第 1 步:收集数据
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
收集数据是解决任何监督式机器学习问题的最重要一步。文本分类器的质量取决于构建它时使用的数据集。
如果您没有要解决的特定问题,而只是有兴趣了解一般文本分类,则可以使用大量的开源数据集。您可以在我们的 GitHub 代码库中找到一些代码库的链接。另一方面,如果您正在解决特定问题,则需要收集必要的数据。许多组织都提供用于访问其数据的公共 API,例如 X API 或 NY Times API。您或许可以利用这些 API 来解决问题。
在收集数据时,请注意以下重要事项:
- 如果您使用的是公共 API,请先了解 API 的限制,然后再加以使用。例如,某些 API 对查询的速率设置了限制。
- 您拥有的训练示例(在本指南其余部分称为示例)越多越好。这有助于模型更好地泛化。
- 确保每个类或主题的样本数量不会过于不均衡。也就是说,每个类别中的样本数量应该相当。
- 确保您的样本充分覆盖可能输入的空间,而不仅仅是常见情况。
在本指南中,我们将使用互联网电影数据库 (IMDb) 影评数据集来说明该工作流程。此数据集包含用户在 IMDb 网站上发布的影评,以及表明评价者是否喜欢该影片的相应标签(“正面”或“负面”)。这就是情感分析问题的典型示例。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2025-07-27。
[null,null,["最后更新时间 (UTC):2025-07-27。"],[[["\u003cp\u003eHigh-quality data is crucial for building effective supervised machine learning text classifiers, with more training samples generally leading to better performance.\u003c/p\u003e\n"],["\u003cp\u003ePublic APIs and open-source datasets can be leveraged for data collection, but it's important to understand API limitations and ensure data balance across classes.\u003c/p\u003e\n"],["\u003cp\u003eAdequate data representation across all possible input variations is necessary, and the IMDb movie reviews dataset will be used to demonstrate text classification workflow for sentiment analysis.\u003c/p\u003e\n"],["\u003cp\u003eWhen collecting data, aim for a balanced dataset with a sufficient number of samples for each class to avoid imbalanced datasets and promote better model generalization.\u003c/p\u003e\n"]]],[],null,["# Step 1: Gather Data\n\nGathering data is the most important step in solving any supervised machine\nlearning problem. Your text classifier can only be as good as the dataset it is\nbuilt from.\n\nIf you don't have a specific problem you want to solve and are just interested\nin exploring text classification in general, there are plenty of open source\ndatasets available. You can find links to some of them in our [GitHub\nrepo](https://github.com/google/eng-edu/blob/master/ml/guides/text_classification/load_data.py).\nOn the other hand, if you are tackling a specific problem,\nyou will need to collect the necessary data. Many organizations provide public\nAPIs for accessing their data---for example, the\n[X API](https://developer.x.com/docs) or the\n[NY Times API](http://developer.nytimes.com/). You may be able to leverage\nthese APIs for the problem you are trying to solve.\n\nHere are some important things to remember when collecting data:\n\n- If you are using a public API, understand the *limitations* of the API before using them. For example, some APIs set a limit on the rate at which you can make queries.\n- The more training examples (referred to as *samples* in the rest of this guide) you have, the better. This will help your model [generalize](/machine-learning/glossary#generalization) better.\n- Make sure the number of samples for every *class* or topic is not overly [imbalanced](/machine-learning/glossary#class_imbalanced_data_set). That is, you should have comparable number of samples in each class.\n- Make sure that your samples adequately cover the *space of possible inputs*, not only the common cases.\n\nThroughout this guide, we will use the [Internet Movie Database (IMDb) movie\nreviews dataset](http://ai.stanford.edu/%7Eamaas/data/sentiment/) to illustrate\nthe workflow. This dataset contains movie reviews posted by people on the IMDb\nwebsite, as well as the corresponding labels (\"positive\" or \"negative\")\nindicating whether the reviewer liked the movie or not. This is a classic\nexample of a sentiment analysis problem."]]