Tetap teratur dengan koleksi
Simpan dan kategorikan konten berdasarkan preferensi Anda.
Sebelum membuat vektor fitur, sebaiknya pelajari data numerik dengan
dua cara:
Visualisasikan data Anda dalam plot atau grafik.
Mendapatkan statistik tentang data Anda.
Memvisualisasikan data Anda
Grafik dapat membantu Anda menemukan anomali atau pola yang tersembunyi dalam data.
Oleh karena itu, sebelum melakukan analisis lebih lanjut, lihat data Anda secara grafis, baik sebagai plot sebar maupun histogram. Lihat grafik tidak hanya di awal pipeline data, tetapi juga di seluruh transformasi data. Visualisasi membantu Anda terus memeriksa asumsi.
Perhatikan bahwa alat visualisasi tertentu dioptimalkan untuk format data tertentu.
Alat visualisasi yang membantu Anda mengevaluasi buffering protokol mungkin dapat atau tidak
dapat membantu Anda mengevaluasi data CSV.
Mengevaluasi data secara statistik
Selain analisis visual, sebaiknya evaluasi potensi fitur dan label secara matematis, dengan mengumpulkan statistik dasar seperti:
rata-rata dan median
simpangan baku
nilai pada pembagian kuartil: persentil ke-0, 25, 50, 75, dan 100. Persentil ke-0 adalah nilai minimum kolom ini; persentil ke-100 adalah nilai maksimum kolom ini. (Persentil ke-50 adalah median.)
Menemukan outlier
Pengabaian adalah nilai yang jauh
dari sebagian besar nilai lainnya dalam fitur atau label. Pencilan sering kali menyebabkan masalah
dalam pelatihan model, sehingga menemukan pencilan sangatlah penting.
Jika delta antara persentil ke-0 dan ke-25 berbeda secara signifikan
dari delta antara persentil ke-75 dan ke-100, set data mungkin
berisi outlier.
Nilai ekstrem dapat termasuk dalam salah satu kategori berikut:
Pengecualian terjadi karena kesalahan.
Misalnya, mungkin eksperimen salah memasukkan angka nol tambahan,
atau mungkin instrumen yang mengumpulkan data mengalami malfungsi.
Anda biasanya akan menghapus contoh yang berisi outlier kesalahan.
Pengecualian adalah titik data yang sah, bukan kesalahan.
Dalam hal ini, apakah model terlatih Anda
akhirnya perlu menyimpulkan prediksi yang baik pada outlier ini?
Jika ya, pertahankan outlier ini dalam set pelatihan Anda. Bagaimanapun, outlier
dalam fitur tertentu terkadang mencerminkan outlier dalam label, sehingga
outlier sebenarnya dapat membantu model Anda membuat prediksi yang lebih baik.
Hati-hati, outlier ekstrem masih dapat merusak model Anda.
Jika tidak, hapus outlier atau terapkan teknik rekayasa fitur yang lebih agresif, seperti pemangkasan.
[null,null,["Terakhir diperbarui pada 2025-02-26 UTC."],[[["\u003cp\u003eBefore creating feature vectors, it is crucial to analyze numerical data by visualizing it through plots and graphs and calculating basic statistics like mean, median, and standard deviation.\u003c/p\u003e\n"],["\u003cp\u003eVisualizations, such as scatter plots and histograms, can reveal anomalies and patterns in the data, aiding in identifying potential issues early in the data analysis process.\u003c/p\u003e\n"],["\u003cp\u003eOutliers, values significantly distant from others, should be identified and handled appropriately, either by correcting mistakes, retaining legitimate outliers for model training, or applying techniques like clipping.\u003c/p\u003e\n"],["\u003cp\u003eStatistical evaluation helps in understanding the distribution and characteristics of data, providing insights into potential feature and label relationships.\u003c/p\u003e\n"],["\u003cp\u003eWhile basic statistics and visualizations provide valuable insights, it's essential to remain vigilant as anomalies can still exist in seemingly well-balanced data.\u003c/p\u003e\n"]]],[],null,["# Numerical data: First steps\n\nBefore creating feature vectors, we recommend studying numerical data in\ntwo ways:\n\n- Visualize your data in plots or graphs.\n- Get statistics about your data.\n\nVisualize your data\n-------------------\n\nGraphs can help you find anomalies or patterns hiding in the data.\nTherefore, before getting too far into analysis, look at your\ndata graphically, either as scatter plots or histograms. View graphs not\nonly at the beginning of the data pipeline, but also throughout data\ntransformations. Visualizations help you continually check your assumptions.\n\nWe recommend working with pandas for visualization:\n\n- [Working with Missing Data (pandas\n Documentation)](http://pandas.pydata.org/pandas-docs/stable/missing_data.html)\n- [Visualizations (pandas\n Documentation)](http://pandas.pydata.org/pandas-docs/stable/visualization.html)\n\nNote that certain visualization tools are optimized for certain data formats.\nA visualization tool that helps you evaluate protocol buffers may or may not\nbe able to help you evaluate CSV data.\n\nStatistically evaluate your data\n--------------------------------\n\nBeyond visual analysis, we also recommend evaluating potential features and\nlabels mathematically, gathering basic statistics such as:\n\n- mean and median\n- standard deviation\n- the values at the quartile divisions: the 0th, 25th, 50th, 75th, and 100th percentiles. The 0th percentile is the minimum value of this column; the 100th percentile is the maximum value of this column. (The 50th percentile is the median.)\n\nFind outliers\n-------------\n\nAn [**outlier**](/machine-learning/glossary#outliers) is a value *distant*\nfrom most other values in a feature or label. Outliers often cause problems\nin model training, so finding outliers is important.\n\nWhen the delta between the 0th and 25th percentiles differs significantly\nfrom the delta between the 75th and 100th percentiles, the dataset probably\ncontains outliers.\n| **Note:** Don't over-rely on basic statistics. Anomalies can also hide in seemingly well-balanced data.\n\nOutliers can fall into any of the following categories:\n\n- The outlier is due to a *mistake*. For example, perhaps an experimenter mistakenly entered an extra zero, or perhaps an instrument that gathered data malfunctioned. You'll generally delete examples containing mistake outliers.\n- The outlier is a legitimate data point, *not a mistake* . In this case, will your trained model ultimately need to infer good predictions on these outliers?\n - If yes, keep these outliers in your training set. After all, outliers in certain features sometimes mirror outliers in the label, so the outliers could actually *help* your model make better predictions. Be careful, extreme outliers can still hurt your model.\n - If no, delete the outliers or apply more invasive feature engineering techniques, such as [**clipping**](/machine-learning/glossary#clipping).\n\n| **Key terms:**\n|\n| - [Clipping](/machine-learning/glossary#clipping)\n- [Outliers](/machine-learning/glossary#outliers) \n[Help Center](https://support.google.com/machinelearningeducation)"]]