AWS – Refit transforms to prepare data at scale with Amazon SageMaker Data Wrangler
Today, we are excited to announce support to refit transforms with Amazon SageMaker Data Wrangler. To make data usable by algorithms such as XgBoost, data scientists must transform non-numeric values to numeric values using transforms such as one-hot encoding. Since transforms like one-hot encoding depend on the data, these transforms are frequently referred to as fitted transforms. These transforms must be updated or re-fitted to account for changes in the data as data continues to change over time. Additionally, when working on a sample data set, transforms must be updated to account for changes between a sample data set and the larger data set. Use of transforms like one-hot encoding generates additional information, which needs to be tracked and captured in the data preparation pipeline. Omitting or incorrectly tracking this information can lead to errors in the data preparation process. Without support to refit transforms, many data scientists did not have an easy way to specify when to use a fitted version of a transform or to refit their transform on new data. Data scientists also lacked an easy way to generate updated versions of their transformation pipelines when refitting on new datasets.
Read More for the details.