.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "tutorials/curation/plot_2_train_a_model.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_tutorials_curation_plot_2_train_a_model.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_tutorials_curation_plot_2_train_a_model.py:


Training a model for automated curation
=======================================

If the pretrained models do not give satisfactory performance on your data, it is easy to train your own classifier using SpikeInterface.

.. GENERATED FROM PYTHON SOURCE LINES 10-14

Step 1: Generate and label data
-------------------------------

First we will import our dependencies

.. GENERATED FROM PYTHON SOURCE LINES 14-28

.. code-block:: Python

    import warnings
    warnings.filterwarnings("ignore")
    from pathlib import Path
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt

    import spikeinterface.core as si
    import spikeinterface.curation as sc
    import spikeinterface.widgets as sw

    # Note, you can set the number of cores you use using e.g.
    # si.set_global_job_kwargs(n_jobs = 8)


.. GENERATED FROM PYTHON SOURCE LINES 29-39

For this tutorial, we will use simulated data to create ``recording`` and ``sorting`` objects. We'll
create two sorting objects: :code:`sorting_1` is coupled to the real recording, so the spike times of the sorter will
perfectly match the spikes in the recording. Hence this will contain good units. However, we've
uncoupled :code:`sorting_2` to the recording and the spike times will not be matched with the spikes in the recording.
Hence these units will mostly be random noise. We'll combine the "good" and "noise" sortings into one sorting
object using :code:`si.aggregate_units`.

(When making your own model, you should
`load your own recording <https://spikeinterface.readthedocs.io/en/latest/modules/extractors.html>`_
and `do a sorting <https://spikeinterface.readthedocs.io/en/latest/modules/sorters.html>`_ on your data.)

.. GENERATED FROM PYTHON SOURCE LINES 39-45

.. code-block:: Python


    recording, sorting_1 = si.generate_ground_truth_recording(num_channels=4, seed=1, num_units=5)
    _, sorting_2 =si.generate_ground_truth_recording(num_channels=4, seed=2, num_units=5)

    both_sortings = si.aggregate_units([sorting_1, sorting_2])


.. GENERATED FROM PYTHON SOURCE LINES 46-48

To do some visualisation and postprocessing, we need to create a sorting analyzer, and
compute some extensions:

.. GENERATED FROM PYTHON SOURCE LINES 48-52

.. code-block:: Python


    analyzer = si.create_sorting_analyzer(sorting = both_sortings, recording=recording)
    analyzer.compute(['noise_levels','random_spikes','waveforms','templates'])


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    estimate_sparsity (no parallelization):   0%|          | 0/10 [00:00<?, ?it/s]    estimate_sparsity (no parallelization): 100%|██████████| 10/10 [00:00<00:00, 535.08it/s]
    noise_level (no parallelization):   0%|          | 0/20 [00:00<?, ?it/s]    noise_level (no parallelization): 100%|██████████| 20/20 [00:00<00:00, 281.93it/s]
    compute_waveforms (no parallelization):   0%|          | 0/10 [00:00<?, ?it/s]    compute_waveforms (no parallelization): 100%|██████████| 10/10 [00:00<00:00, 461.81it/s]


.. GENERATED FROM PYTHON SOURCE LINES 53-56

Now we can plot the templates for the first and fifth units. The first (unit id 0) belongs to
:code:`sorting_1` so should look like a real unit; the sixth (unit id 5) belongs to :code:`sorting_2`
so should look like noise.

.. GENERATED FROM PYTHON SOURCE LINES 56-59

.. code-block:: Python


    sw.plot_unit_templates(analyzer, unit_ids=["0", "5"])


.. image-sg:: /tutorials/curation/images/sphx_glr_plot_2_train_a_model_001.png
   :alt: template 0, template 5
   :srcset: /tutorials/curation/images/sphx_glr_plot_2_train_a_model_001.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <spikeinterface.widgets.unit_templates.UnitTemplatesWidget object at 0x73e4c4da1ea0>


.. GENERATED FROM PYTHON SOURCE LINES 60-64

This is as expected: great! (Find out more about plotting `using widgets <https://spikeinterface.readthedocs.io/en/latest/modules/widgets.html>`_.)
We've set up our system so that the first five units are 'good' and the next five are 'bad'.
So we can make a list of labels which contain this information. For real data, you could
use a manual curation tool to make your own list.

.. GENERATED FROM PYTHON SOURCE LINES 64-67

.. code-block:: Python


    labels = ['good', 'good', 'good', 'good', 'good', 'bad', 'bad', 'bad', 'bad', 'bad']


.. GENERATED FROM PYTHON SOURCE LINES 68-76

Step 2: Train our model
-----------------------

We'll now train a model, based on our labelled data. The model will be trained using properties
of the units, and then be applied to units from other sortings. The properties we use are the
`quality metrics <https://spikeinterface.readthedocs.io/en/latest/modules/qualitymetrics.html>`_
and `template metrics <https://spikeinterface.readthedocs.io/en/latest/modules/postprocessing.html#template-metrics>`_.
Hence we need to compute these, using some ``sorting_analyzer``` extensions.

.. GENERATED FROM PYTHON SOURCE LINES 76-79

.. code-block:: Python


    analyzer.compute(['spike_locations','spike_amplitudes','correlograms','principal_components','quality_metrics','template_metrics'])


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Fitting PCA:   0%|          | 0/10 [00:00<?, ?it/s]    Fitting PCA: 100%|██████████| 10/10 [00:00<00:00, 176.04it/s]
    Projecting waveforms:   0%|          | 0/10 [00:00<?, ?it/s]    Projecting waveforms: 100%|██████████| 10/10 [00:00<00:00, 2423.75it/s]
    Compute : spike_locations + spike_amplitudes (no parallelization):   0%|          | 0/10 [00:00<?, ?it/s]    Compute : spike_locations + spike_amplitudes (no parallelization): 100%|██████████| 10/10 [00:00<00:00, 205.16it/s]
    noise_level (no parallelization):   0%|          | 0/20 [00:00<?, ?it/s]    noise_level (no parallelization): 100%|██████████| 20/20 [00:00<00:00, 384.76it/s]
    calculate pc_metrics:   0%|          | 0/10 [00:00<?, ?it/s]    calculate pc_metrics:  40%|████      | 4/10 [00:00<00:00, 30.20it/s]    calculate pc_metrics:  80%|████████  | 8/10 [00:00<00:00, 30.17it/s]    calculate pc_metrics: 100%|██████████| 10/10 [00:00<00:00, 30.17it/s]


.. GENERATED FROM PYTHON SOURCE LINES 80-91

Now that we have metrics and labels, we're ready to train the model using the
``train_model``` function. The trainer will try several classifiers, imputation strategies and
scaling techniques then save the most accurate. To save time in this tutorial,
we'll only try one classifier (Random Forest), imputation strategy (median) and scaling
technique (standard scaler).

We will use a list of one analyzer here, so the model is trained on a single
session. In reality, we would usually train a model using multiple analyzers from an
experiment, which should make the model more robust. To do this, you can simply pass
a list of analyzers and a list of manually curated labels for each
of these analyzers. Then the model would use all of these data as input.

.. GENERATED FROM PYTHON SOURCE LINES 91-107

.. code-block:: Python


    trainer = sc.train_model(
        mode = "analyzers", # You can supply a labelled csv file instead of an analyzer
        labels = [labels],
        analyzers = [analyzer],
        folder = "my_folder", # Where to save the model and model_info.json file
        metric_names = None, # Specify which metrics to use for training: by default uses those already calculted
        imputation_strategies = ["median"], # Defaults to all
        scaling_techniques = ["standard_scaler"], # Defaults to all
        classifiers = None, # Default to Random Forest only. Other classifiers you can try [ "AdaBoostClassifier","GradientBoostingClassifier","LogisticRegression","MLPClassifier"]
        overwrite = True, # Whether or not to overwrite `folder` if it already exists. Default is False.
        search_kwargs = {'cv': 3} # Parameters used during the model hyperparameter search
    )

    best_model = trainer.best_pipeline


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Running RandomForestClassifier with imputation median and scaling StandardScaler()
    BayesSearchCV from scikit-optimize not available, using RandomizedSearchCV


.. GENERATED FROM PYTHON SOURCE LINES 108-122

You can pass many sklearn `classifiers <https://scikit-learn.org/1.5/api/sklearn.impute.html>`_
`imputation strategies <https://scikit-learn.org/1.5/api/sklearn.impute.html>`_ and
`scalers <https://scikit-learn.org/1.5/api/sklearn.preprocessing.html>`_, although the
documentation is quite overwhelming. You can find the classifiers we've tried out
using the ``sc.get_default_classifier_search_spaces`` function.

The above code saves the model in ``model.skops``, some metadata in
``model_info.json`` and the model accuracies in ``model_accuracies.csv``
in the specified ``folder`` (in this case ``'my_folder'``).

(``skops`` is a file format: you can think of it as a more-secure pkl file. `Read more <https://skops.readthedocs.io/en/stable/index.html>`_.)

The ``model_accuracies.csv`` file contains the accuracy, precision and recall of the
tested models. Let's take a look:

.. GENERATED FROM PYTHON SOURCE LINES 123-127

.. code-block:: Python


    accuracies = pd.read_csv(Path("my_folder") / "model_accuracies.csv", index_col = 0)
    accuracies.head()


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>classifier name</th>
          <th>imputation_strategy</th>
          <th>scaling_strategy</th>
          <th>balanced_accuracy</th>
          <th>precision</th>
          <th>recall</th>
          <th>model_id</th>
          <th>best_params</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>RandomForestClassifier</td>
          <td>median</td>
          <td>StandardScaler()</td>
          <td>1.0</td>
          <td>1.0</td>
          <td>1.0</td>
          <td>0</td>
          <td>{'n_estimators': 150, 'min_samples_split': 4, ...</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 128-135

Our model is perfect!! This is because the task was *very* easy. We had 10 units; where
half were pure noise and half were not.

The model also contains some more information, such as which features are "important",
as defined by sklearn (learn about feature importance of a
`Random Forest Classifier <https://scikit-learn.org/1.5/auto_examples/ensemble/plot_forest_importances.html>`_.)
We can plot these:

.. GENERATED FROM PYTHON SOURCE LINES 135-159

.. code-block:: Python


    # Plot feature importances
    importances = best_model.named_steps['classifier'].feature_importances_
    indices = np.argsort(importances)[::-1]

    # The sklearn importances are not computed for inputs whose values are all `nan`.
    # Hence, we need to pick out the non-`nan` columns of our metrics
    features = best_model.feature_names_in_
    n_features = best_model.n_features_in_

    metrics = pd.concat([analyzer.get_extension('quality_metrics').get_data(), analyzer.get_extension('template_metrics').get_data()], axis=1)
    non_null_metrics = ~(metrics.isnull().all()).values

    features = features[non_null_metrics]
    n_features = len(features)

    plt.figure(figsize=(12, 7))
    plt.title("Feature Importances")
    plt.bar(range(n_features), importances[indices], align="center")
    plt.xticks(range(n_features), features[indices], rotation=90)
    plt.xlim([-1, n_features])
    plt.subplots_adjust(bottom=0.3)
    plt.show()


.. image-sg:: /tutorials/curation/images/sphx_glr_plot_2_train_a_model_002.png
   :alt: Feature Importances
   :srcset: /tutorials/curation/images/sphx_glr_plot_2_train_a_model_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 160-169

Roughly, this means the model is using metrics such as "nn_hit_rate" and "l_ratio"
but is not using "sync_spike_4" and "rp_contanimation". This is a toy model, so don't
take these results seriously. But using this information, you could retrain another,
simpler model using a subset of the metrics, by passing, e.g.,
``metric_names = ['nn_hit_rate', 'l_ratio',...]`` to the ``train_model`` function.

Now that you have a model, you can `apply it to another sorting
<https://spikeinterface.readthedocs.io/en/latest/tutorials/curation/plot_1_automated_curation.html>`_
or `upload it to HuggingFaceHub <https://spikeinterface.readthedocs.io/en/latest/tutorials/curation/plot_3_upload_a_model.html>`_.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 9.604 seconds)


.. _sphx_glr_download_tutorials_curation_plot_2_train_a_model.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_2_train_a_model.ipynb <plot_2_train_a_model.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_2_train_a_model.py <plot_2_train_a_model.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_2_train_a_model.zip <plot_2_train_a_model.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_