Artificial intelligence (AI) and machine learning (ML) technologies are powering a rapidly expanding array of products and applications, from deeply embedded systems to hyperscale data center deployments. Although the hardware designs that support these applications vary widely, they all require hardware acceleration.
Deep learning techniques require a lot of tensor arithmetic operations. To support real-time execution, the performance of memory and processors must meet performance targets that are much higher than possible with standard software-driven architectures. This need has led to the use of specialized hardware accelerator-based designs to perform parallelized and highly pipelined tensor arithmetic operations. To avoid channel blocking, data must be in the right place, at the right time, and in the right format. Dedicated data orchestration hardware avoids accelerator channel blocking, enabling operation at maximum efficiency.
Data orchestration includes pre-processing and post-processing operations to ensure that data is delivered to the machine learning engine at the optimal speed and in the format most suitable for efficient processing. Operations range from resource management and usage planning, to I/O adaptation, transcoding, transformation, and sensor fusion, to data compression and rearrangement within shared storage arrays. How these capabilities are deployed will depend on the performance and cost requirements of the target application, but for most application scenarios, a programmable logic platform optimized for data ingestion, transformation, and delivery provides the best data orchestration strategy for machine learning accelerators .
introduction
Deep learning puts enormous pressure on computing hardware. The shift to dedicated accelerators provides a way for chip technology to keep pace with AI developments, but the units alone are not capable of meeting the need for higher performance at a lower cost.
Understandably, integrated circuit (IC) suppliers and systems companies have focused on the raw performance of their matrix and tensor processing arrays. At peak throughput, these architectures can easily reach performance levels measured in trillion operations per second (TOPS), even for systems designed for edge computing. While understandable, the focus on peak TOPS runs the risk of underutilizing hardware if there are delays due to unavailability of data or the need to convert to the correct format for each model layer.
The system must compensate for network and storage latency and ensure that data elements are in the proper format and location, while passing in and out of the AI accelerator at a consistent rate. Data orchestration provides a way to ensure that data is properly formatted and positioned on every clock cycle, maximizing system throughput.
Due to the complexity of typical AI implementations, whether located in data centers, edge computing environments, or real-time embedded applications such as autonomous driver assistance system (ADAS) design, there are many tasks that must be handled by the data orchestration engine. These tasks include:
Data manipulation (manipulation)
Scheduling and load balancing across multiple vector units
Packet inspection for data corruption, such as data corruption caused by sensor failure
Although these functions can be achieved by adding data control and exception handling hardware to the core processing array, the wide variety of operations that may be required and the increasing need for flexibility as AI models evolve make it difficult to Hard-wiring these functions into the core accelerator chip could become an expensive short-term option. For example, in some application environments, encryption support is rapidly becoming a requirement to ensure high data security, but depending on the application sensitivity of the data at each layer, different levels of encryption may be used. Fixed architecture solutions run the risk of not being able to adapt to changing needs.
One possible approach is to use a programmable microprocessor to control the flow of data through the accelerator. The problem with this approach is that the software execution simply cannot meet the needs of the accelerator hardware. The need for a more hardware-centric data orchestration response makes it possible for accelerator designs to focus entirely on core channel efficiency. External data orchestration handles all storage and I/O management, ensuring uninterrupted transfer of operands and weights. Since the data orchestration engine must handle revisions and changes to the application and model design, hard-wired logic is not an appropriate approach. Programmable logic supports modification and avoids the risk of data orchestration engines failing to update.
In principle, Field Programmable Logic Gate Arrays (FPGAs) combine distributed memory, arithmetic units, and look-up tables to provide combinatorial capabilities ideal for the real-time reorganization, reorganization, and reorganization of streaming data required by AI-driven applications. Mapping and memory management. FPGAs enable the creation of custom hardware circuits that support the intensive data flow of deeply pipelined AI accelerators, while enabling users to change implementations as needed to accommodate new architectures. However, the performance requirements of data orchestration require a new approach to FPGA design.
Application Scenarios for Data Orchestration
There are many different types of data orchestration architectures in application scenarios such as data centers, edge computing, and embedded system deployments. For example, in a data center application environment, multiple accelerators can be deployed on a single model, with their data throughput managed by one or more data orchestration engines.
Inference systems require data orchestration to ensure maximum utility of each worker engine to avoid bottlenecks and ensure that incoming data samples are processed as quickly as possible. Distributed training increases the requirement for fast updates of neuron weights, which must be distributed to other work engines processing related model parts as quickly as possible to avoid stalling.
The data orchestration logic in the FPGA supports processing a wide range of weight distribution and synchronization protocols to support efficient operation while alleviating the data organization burden on the accelerator itself. The diagram below shows one possible implementation using a single FPGA device to manage multiple AI engines on the same board. Using a suitable low-noise communication protocol, a single machine learning application specific integrated circuit (ASIC) does not require a memory controller. Instead, the data orchestration engine organizes all weights and data elements in local memory and simply transfers them in the proper order to each ASIC it manages. The result is high performance at a lower overall cost by reducing duplicate storage and interface logic.
Figure 1: Data orchestration can quickly provide load balancing and other data forwarding functions for parallelized AI-enabled applications
With data orchestration, hardware can further improve performance without increasing cost. One option is to take advantage of network or system bus data compression, avoiding the use of more expensive interconnects. The logic-level programmability of the FPGA supports the compression and decompression of data over a network interface. The data orchestration hardware also supports the use of forward error correction protocols to ensure that valid data is transmitted at full pipeline speed. Corruption events are usually rare in most designs, but without external error correction support, recovery costs will be high for highly pipelined accelerator designs.
Figure 2 shows the various ways that the data orchestration engine can optimize the data flow and render the results provided to the machine learning engine. For example, the format and structure of individual data elements presents an important opportunity to exploit the benefits of data orchestration, as source data often must be represented in a format suitable for feature extraction by deep neural networks (DNNs).
In image recognition and classification applications, pixel data is often channelized so that each color plane can be processed individually before the results are aggregated through pooling layers that extract shape and other high-level information. Channelization helps to identify edges and other features that may not be easily identifiable with the combined RGB representation. A wider range of transformations are performed in speech and language processing. Data is usually mapped into a form that is easier to process by DNNs. Since instead of processing ASCII or Unicode characters directly, the words and subwords to be processed in the model are converted into vector and one-hot representations. Similarly, speech data may not be presented as raw time-domain samples, but transformed into a joint time-frequency representation, making important features easier to identify by earlier DNN layers.
Although data transformation can be performed by arithmetic cores in AI accelerators, it may not be well suited for tensor engines. The reformatted nature makes it suitable for processing by FPGA-based modules. FPGAs can efficiently convert at line speed without the latency associated with running software on general-purpose processors.
In real-time and embedded applications involving sensors, preprocessing data can bring additional benefits. For example, while a DNN can be trained to remove the effects of noise and changes in environmental conditions, using front-end signal processing to denoise or normalize the data improves its reliability. In automotive advanced driver assistance system (ADAS) implementations, camera systems must handle changes in lighting conditions. Typically, the high levels of dynamic range in the sensor can be exploited by using brightness and contrast adjustments. The FPGA can perform the necessary operations to provide the DNN with a less variable stream of pixels.
Sensor fusion is an increasingly important aspect of ADAS design, helping to improve the performance of end systems. Because environmental conditions can make individual sensor data difficult to interpret, AI models must efficiently take input from many different types of sensors, including cameras, lidar, and radar.
Format conversion is critical. For example, lidar (LIDAR) provides depth information to target objects in Cartesian space, while radar operates on a polar coordinate system. Many models make sensor fusion easier by transforming one coordinate space into another. Similarly, image data from multiple cameras must be stitched together and transformed using projections to deliver the most useful information to the AI model.
Lower level transformations are also required. Automotive original equipment manufacturers (OEMs) purchase sensor modules from different suppliers, each of which interprets the connectivity communication standards in their own way. This requires some functionality to parse the data packets these sensors send over the in-vehicle network and convert the data into a standard format that DNNs can handle. For security reasons, the Module must also authenticate to the ADAS unit and, in some cases, send encrypted data. The data orchestration chip supports offloading decryption and format conversion functions from the AI accelerator engine.
Further optimization can be achieved by removing unnecessary data using front-end signal processing functions implemented in the data orchestration subsystem. For example, sensors that process input from microphones and other 1D sensors can remove noise when muted or in low-level backgrounds, and reduce the number of video frames delivered when the vehicle is stationary, reducing the load on the AI engine.
Figure 2: Data Orchestration Offers Multiple Options for Accelerating AI Functions
An architecture optimized for data orchestration
While the combination of configurable interconnect and programmable logic within an FPGA facilitates data orchestration tasks, FPGA architectures are inherently different, and how they handle the need for high-bandwidth data is key. Traditionally, FPGAs have not been expected to be the core elements of the data path, but have primarily provided control plane assistance to processors that interact with storage and I/O. Data orchestration requires cores to receive, transform, and manage data elements on behalf of processors and accelerators, which puts enormous pressure on traditional FPGA architectures.
To support the bandwidth requirements of data orchestration, traditional FPGAs require extremely wide buses to handle multiple data streams over PCI Express and Gigabit Ethernet interfaces. For example, to support the transfer of Ethernet data in excess of 400Gb/s, designers must use programmable interconnect circuitry to route a bus approximately 2048 bits wide to reliably meet timing requirements, which typically requires an operating frequency of Hundreds of megahertz clocks. Such wide interconnects are very difficult to route due to congestion and timing closure issues with such large structures. Interconnects can consume hundreds of thousands of look-up tables (LUTs) because they cannot be used to perform data orchestration or format conversion tasks.
The Achronix Speedster7t family of FPGA devices overcomes the problems faced by traditional FPGAs, in part because it employs a specialized two-dimensional network-on-chip (2D NoC) that interconnects the A total bandwidth of up to 20Tb/s can be achieved. Relative to FPGA fabric interconnects, 2D network-on-a-chip not only provides a huge speed boost, but is capable of transferring large amounts of data at higher rates between multiple PCIe Gen5, 400Gbps Ethernet ports, and GDDR6 memory interfaces, while Does not consume any FPGA programmable resources.
In Speedster7t FPGA devices, the on-chip network provides a two-dimensional interconnect fabric across the entire surface of the FPGA. It uses a dedicated network access point (NAP) to send packets to the soft core anywhere within the device. Each NAP provides access to programmable logic blocks or hardware resources within the FPGA through an industry-standard AXI port structure. There are separate NAPs for east-west and north-south data flows, providing additional flexibility and performance for accessing the logic of the 2D network-on-chip. This directional partitioning helps to optimize transport delays that start and end on the same 2D on-chip network path. Routing on an orthogonal 2D on-chip network path adds a small, deterministic delay.
An important feature provided by 2D network-on-chip is Packet Mode, which is designed to more easily rearrange data arriving at high-bandwidth ports such as Ethernet into multiple data streams. Packet mode can separate packets arriving at 200Gb/s or 400Gb/s Ethernet ports and transfer them to different soft cores. This packet separation is shown in the figure below, where successive packets are distributed to different parts of the FPGA. Therefore, grouping mode makes it easy to create load-balancing architectures that are difficult to achieve with traditional FPGAs.
Figure 3: The grouping mode of the network-on-chip enables automatic distribution of network payloads to different parts of the architecture
Another benefit is that the 2D network-on-chip is easier to support partial reconfiguration: each logic block in the 2D array can act as an isolatable resource that can be exchanged for new functions without affecting any other logic blocks. This capability is further enhanced by virtualization and translation logic implemented by the 2D on-chip network and access point controllers.
The role of the address translation table is similar to the memory management unit in a microprocessor to prevent data from interfering with each other between tasks. The address translation table in the access point means that each soft core can access the same virtual address range, but access the external physical storage in a completely different range. Access protection bits provide further security, preventing the kernel from accessing protected address ranges. This level of protection is likely to become extremely important in a range of AI-based applications. In these applications, data orchestration and other programmable logic functions are implemented by different teams before being integrated into the final product.
In addition to highly flexible data routing, data orchestration requires the application of fast arithmetic functions to enhance core AI accelerators. The Speedster7t FPGA deploys a series of Machine Learning Processor (MLP) modules. Each MLP is a highly configurable, compute-intensive block that can be configured with up to 32 multipliers, delivering up to 60 TOPS of performance. MLP supports integer formats from 4 to 24 bits and various floating-point modes, including bfloat16 and block floating-point (BFP) formats that directly support Tensorflow. The surrounding programmable logic architecture provides multiple ways to optimize data flow to take full advantage of the data reuse and throughput opportunities offered by MLPs.
Since data orchestration hardware needs to be suitable for various application environments, there is a clear need for flexible deployment. Data center applications may require the use of one or more discrete, high-capacity devices (such as Speedster7t FPGA devices) to route and preprocess data streams for multiple machine learning engines on a single circuit board or distributed within a tray or rack . For edge computing applications where size, power consumption and cost are the main limiting factors, there is a clear argument for adopting a system-on-chip (SoC) solution.
Achronix is the only company that can offer both stand-alone FPGA chips and embedded FPGA (eFPGA) semiconductor intellectual property (IP) technology and is therefore uniquely positioned to support cost reduction programs where programmable logic and interconnect functions can be integrated into In a SoC, as shown in the figure below. The Speedcore eFPGA IP uses the same technology as the Speedster7t FPGA, enabling a seamless transition from a Speedster7t FPGA to an ASIC with integrated Speedcore modules. When converting Speedster7t FPGAs to ASICs using Speedcore IP, customers can expect up to 50% lower power consumption and up to 90% unit cost savings.
Another option is to use multi-chip on-package chiplets in a multi-chip module. This provides the benefits of a high-speed interconnect between the FPGA-based co-packaged data orchestration module and the machine learning engine. Achronix supports all of these implementation options.
Figure 4: Embedded FPGA technology enables data orchestration to be integrated into accelerator chips
in conclusion
The rapid development of deep learning has put enormous pressure on the hardware architectures required to implement the technology at scale. While peak TOPS scores are highly focused by the industry due to the realization that performance is an absolute requirement, intelligent data orchestration and management strategies provide a method for delivering cost-effective and energy-efficient systems.
Data orchestration includes many pre- and post-processing operations, ensuring that data is delivered to the machine learning engine at optimal speed and in the format most suitable for efficient processing. Operations range from resource management and usage planning, to I/O adaptation, transcoding, transformation, and sensor fusion, to data compression and rearrangement within shared storage arrays. Some orchestration engines use subsets of these capabilities based on the core requirements of the target machine learning architecture.
The Achronix Speedster7t FPGA fabric provides a highly flexible platform for these data orchestration strategies. This FPGA is characterized by high throughput, low latency and extreme flexibility, and its data transfer form allows even highly specialized accelerators to adapt to changing needs. In addition, the Speedster7t FPGA’s extensive logic and arithmetic capabilities coupled with high-throughput interconnects enable the overall design of front-end signal conditioning and back-end machine learning to maximize overall efficiency.
The Links: GP377-LG41-24V FP35R12KT4 BSM50GD170DL