生产型机器学习系统:何时转换数据?
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
原始数据必须经过特征工程处理(转换)。何时应转换数据?一般来说,您可以在以下任一时间段执行特征工程:
在此方法中,您需要执行以下两个步骤:
- 编写代码或使用专用工具来转换原始数据。
- 将转换后的数据存储在模型可以提取的位置,例如磁盘上。
优势
- 系统仅转换一次原始数据。
- 系统可以分析整个数据集,以确定最佳转换策略。
缺点
当系统执行动态(在线)推理时,训练-服务偏差更危险。在使用动态推理的系统中,转换原始数据集的软件通常不同于提供预测的软件,这可能会导致训练-应用偏差。相比之下,使用静态(离线)推理的系统有时可以使用相同的软件。
在此方法中,转换是模型代码的一部分。该模型会提取原始数据并对其进行转换。
优势
- 即使您更改了转换,也可以继续使用相同的原始数据文件。
- 这样可确保在训练和预测时进行相同的转换。
缺点
- 复杂的转换可能会增加模型延迟时间。
- 系统会对每个批次执行转换。
转换每批数据可能很棘手。例如,假设您想使用Z 分数归一化来转换原始数值数据。使用 z 分数归一化需要特征的平均值和标准差。不过,每批转换意味着您只能访问一批数据,而不能访问整个数据集。因此,如果批次差异很大,那么一个批次中 Z 得分为 -2.5 与另一个批次中 Z 得分为 -2.5 的含义并不相同。作为一种权宜解决方法,您的系统可以预先计算整个数据集的平均值和标准差,然后将它们用作模型中的常量。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2025-07-27。
[null,null,["最后更新时间 (UTC):2025-07-27。"],[[["\u003cp\u003eFeature engineering can be performed before or during model training, each with its own advantages and disadvantages.\u003c/p\u003e\n"],["\u003cp\u003eTransforming data before training allows for a one-time transformation of the entire dataset but requires careful recreation of transformations during prediction to avoid training-serving skew.\u003c/p\u003e\n"],["\u003cp\u003eTransforming data during training ensures consistency between training and prediction but can increase model latency and complicate batch processing.\u003c/p\u003e\n"],["\u003cp\u003eWhen transforming data during training, considerations such as Z-score normalization across batches with varying distributions need to be addressed.\u003c/p\u003e\n"]]],[],null,["# Production ML systems: When to transform data?\n\nRaw data must be feature engineered (transformed). When should you transform\ndata? Broadly speaking, you can perform feature engineering during either of\nthe following two periods:\n\n- *Before* training the model.\n- *While* training the model.\n\nTransforming data before training\n---------------------------------\n\nIn this approach, you follow two steps:\n\n1. Write code or use specialized tools to transform the raw data.\n2. Store the transformed data somewhere that the model can ingest, such as on disk.\n\nAdvantages\n\n- The system transforms raw data only once.\n- The system can analyze the entire dataset to determine the best transformation strategy.\n\nDisadvantages\n\n- You must recreate the transformations at prediction time. Beware of [**training-serving skew**](/machine-learning/glossary#training-serving-skew)!\n\nTraining-serving skew is more dangerous when your system performs dynamic\n(online) inference.\nOn a system that uses dynamic inference, the software that transforms\nthe raw dataset usually differs from the software that serves predictions,\nwhich can cause training-serving skew.\nIn contrast, systems that use static (offline) inference can sometimes\nuse the same software.\n\nTransforming data while training\n--------------------------------\n\nIn this approach, the transformation is part of the model code. The model\ningests raw data and transforms it.\n\nAdvantages\n\n- You can still use the same raw data files if you change the transformations.\n- You're ensured the same transformations at training and prediction time.\n\nDisadvantages\n\n- Complicated transforms can increase model latency.\n- Transformations occur for each and every batch.\n\nTransforming the data per batch can be tricky. For example, suppose you want to\nuse [**Z-score normalization**](/machine-learning/glossary#z-score-normalization)\nto transform raw numerical data. Z-score normalization requires the mean and\nstandard deviation of the feature.\nHowever, transformations per batch mean you'll only have access to\n*one batch of data*, not the full dataset. So, if the batches are highly\nvariant, a Z-score of, say, -2.5 in one batch won't have the same meaning\nas -2.5 in another batch.\nAs a workaround, your system can precompute the mean and standard deviation\nacross the entire dataset and then use them as constants in the model.\n| **Key terms:**\n|\n| - [Training-serving skew](/machine-learning/glossary#training-serving-skew)\n- [Z-score normalization](/machine-learning/glossary#z-score-normalization) \n[Help Center](https://support.google.com/machinelearningeducation)"]]