Data management is one of the most important aspects of designing an end-to-end AI solution, in addition to factors like your AI infrastructure stack and AI algorithms. Data for AI can be multimodal, varying from small files like documents to very large objects like videos—and storage systems need to be able to handle it all. In an AI pipeline, multiple types of storage workloads exist with different capacity as well as different latency and throughput requirements, depending on the AI phase in the pipeline.
In this first part of our guide to storage for AI, we’ll dive into the storage types needed across an AI workflow. In future posts, we’ll cover considerations for choosing the right locations for hosting the different types of storage in the AI pipeline with respect to performance, cost, availability and privacy.
Types of storage for each phase of an AI pipeline
For a typical enterprise AI initiative, an end-to-end AI workflow includes several phases, such as data ingestion, data processing and curation, AI model training and inferencing, storing output metadata in data warehouses and databases, and deployment to deliver outcomes via dashboards, reports and corrective workflows.
Figure 1: AI workflow
At different points in this process, the data employed for AI can either shrink or enlarge from the previous phase. And it’s crucial to consider not only the capacity requirements but the associated performance and costs of different storage options.
Let’s look at the data types and storage implications for each phase in a typical AI pipeline. The data volume will vary hugely depending upon the use case, so the numbers shown here are only for illustrative purposes. We’ve categorized storage type in two dimensions:
- Performance or capacity optimized
- File or object protocol
Nowadays, most capacity-based storage systems are object based, and high-performance storage systems for AI are mostly file-based. Furthermore, performance-optimized flash media are used for high-performant storage systems, and capacity-optimized flash and disk storage are used for capacity optimized storage (due to lower costs). Tapes are used for archival-based storage.
Figure 2: Data types and storage in the AI pipeline
Raw input data
Size: Tens of petabytes (use case dependent)
Type of storage: Capacity-based file or object storage
Considerations: The raw data used for AI development is vast and spread across multiple clouds, private date centers, data brokers and devices at the edge. Most companies use more than 20 data sources to inform their AI.[1] For multimodal data, it’s beneficial to process the raw input data at the edge instead of backhauling it to core data centers. During data preparation, mostly CPUs are used to clean up your raw data (remove faulty data, anonymize, aggregate, etc.) for AI model training. Typical workloads can see a roughly 10:1 data reduction at this stage from unclean to cleaned data.
Refined AI training data
Size: Petabytes (use case dependent)
Type of storage: High-performance file storage
Considerations: Typically, high-performance file storage systems are used to store and access cleaned data for model training. These storage systems can be connected via either Gigabit Ethernet (GbE) or InfiniBand. GbE standard currently supports higher bandwidth speeds than InfiniBand, whereas InfiniBand has lower latency. For large training jobs, many organizations prefer InfiniBand networks for guaranteed performance SLAs, whereas for AI inferencing they prefer GbE networks due to lower costs and the pervasive nature of Ethernet.
Training checkpoints
Size: Terabytes
Type of storage: High-performance file storage for recent checkpoints; capacity-based file or object storage for older checkpoints
Considerations: AI training checkpoints are snapshots of an AI model’s state at some point in the training process. Checkpoints are used to save an AI model at specific intervals and resume training from there if the training job gets interrupted or a problem surfaces. Your most recent checkpoints should be in high-performance storage because you want to be able to quickly recover in case of failures or outages. But high-performance storage isn’t necessary for older training checkpoints because there’s a lower probability that you’ll need to access them. Checkpoint data size is typically two to three times the size of model size because it contains additional metadata.
Trained AI model
Size: Megabytes to terabytes
Type of storage: High-performance file storage
Considerations: Trained AI models for enterprise use cases typically range from megabytes to terabytes; however, note that generative AI models are larger than traditional AI workloads. The model weight representation (like floating point or integer) and the number of model parameters determine the size of the trained model. When progressing from training data to a trained AI model, companies usually see a 1000:1 data reduction.
Quantized model
Size: Megabytes to gigabytes
Type of storage: High-performance file or object storage
Considerations: A quantized AI model is compressed, or the format is changed from floating point to integer, to reduce the model’s size and computational requirements. Quantization enables faster inference and lower memory usage. Enterprises use the quantized model in deployment and feed in real data. In moving from a trained model to a quantized model, most companies will see a 1000:1 data reduction. People store models completely in memory to improve inference response times.
Alerts metadata
Size: Terabytes (over time)
Type of storage: High-performance file or object storage
Considerations: AI alerts (and associated metadata) are generated as the output of AI inferencing. It’s important to capture and analyze this metadata to understand the behavior of your AI model and take action as needed.
Vector database for RAG inferencing
Size: Petabytes (depending on the size of document stores, number of dimensions indexed)
Type of storage: High-performance file and capacity-based object storage
Considerations: Retrieval-Augmented Generation (RAG) is an AI technique used to make large language model (LLM) results more accurate by providing more contextual information in the input prompts. Vector databases are a crucial component of RAG inferencing. Companies encode documents and store them in vector databases. Moving from raw data to a vector database can result in a 1:3 (or greater) data blowup because the number of dimensions determine the number of indices and the size of storage.
Inference and query logs for compliance
Size: Petabytes (function of number of users and the nature of their multimodal queries)
Type of storage: Capacity-based object storage
Considerations: Increasingly, due to regulatory concerns such as the EU AI Act,[2] organizations must log the input prompts and output responses of their generative AI queries. In many cases, these logs must be transferred from edge locations to centralized archival locations and stored for a regulated period of time. Capacity-focused object storage is recommended.
Think about your storage strategy for AI
Data is the lifeblood of many organizations, and AI is creating new opportunities to use it for competitive advantage. But choosing the right underlying storage for each AI phase is an important and often overlooked element of preparing for AI. When choosing where to put your AI infrastructure, you need to look holistically at the end-to-end AI pipeline and consider the different storage requirements for each phase. Your entire data management strategy is relevant to your success with AI.
Now that we’ve identified the storage requirements of each AI phase, in part 2 of this guide, we’ll discuss the optimum place for executing the different phases of the AI pipeline. That is, where should the AI infrastructure and the associated storage be hosted for different use cases—for example, in the cloud, in a cloud adjacent location like Equinix or on-premises? We’ll also cover some of the key AI-related innovations by storage vendors with respect to improving storage access by GPUs, consolidation of various types of storage to reduce operational costs, and storage fabrics for moving data between distributed AI sites in the AI pipeline.
To learn more about choosing the right infrastructure for modern use cases like AI, download our e-book Hybrid infrastructure: A leader’s guide.