Advice

Machine learning

"If you're researching a new idea, always overfit a single batch before going to bigger tests"
Start with the code and pipeline for inference, not for training.

In more detail:
- First you write the code for the inference using a really simple baseline model and a couple of initial features that have first came into your mind
- Your code stores the features into a feature store (eg. an SQL table in the simplest case), your code also stores predictions of the model
- Your pipeline can be run for any subgroup of data. For example if it is timeseries, you can pass any starting and ending timestamp to you pipeline, so that it is not always working on "the last day of data" or similar. This is needed for training, but also later in production if the pipeline failed and/or you need to re-create predictions for some arbitrary past data
- Your pipeline also calculates your metrics (eg. accuracy) comparing the predictions with the labels. This is needed for training, but also important in production when you're going to monitor your pipeline accuracy.
- Second, you train your real model by taking the features from the feature store. It is cool, because they are calculated in exactly the same way by exactly the same code as they will be later in production for the inference.
- You let the model to predict the test set and use your pipeline to evaluate the accuracy. Same advantage here: same metrics, same code as later in production.
- If you now need to add features, you just implement new ones and go to the first step (run a baseline). If you need to change existing features, you never do that - they are immutable - you create a new feature with a new name (just add "v.2" at the end if you don't comeup with a better descriptive name). And then you always go to first step.
It helps if the code calculating features would check if the feature already in the feature store and skip the step if it is.