Skip to content
Snippets Groups Projects
  • Xiangrui Meng's avatar
    23452be9
    [SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes... · 23452be9
    Xiangrui Meng authored
    [SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast
    
    This PR contains two major changes to `OneHotEncoder`:
    
    1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index
    2. change `includeFirst` to `dropLast` and leave the default to `true`. There are couple benefits:
    
        a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm)
        b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1.
        c. If users use `StringIndex`, the last element is the least frequent one.
    
    Sorry for including two changes in one PR! I'll update the user guide in another PR.
    
    jkbradley sryza
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #6466 from mengxr/SPARK-7912 and squashes the following commits:
    
    a280dca [Xiangrui Meng] fix tests
    d8f234d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7912
    171b276 [Xiangrui Meng] mention the difference between our impl vs sklearn's
    00dfd96 [Xiangrui Meng] update OneHotEncoder in Python
    208ddad [Xiangrui Meng] update OneHotEncoder to handle ML attributes and change includeFirst to dropLast
    23452be9
    History
    [SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes...
    Xiangrui Meng authored
    [SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast
    
    This PR contains two major changes to `OneHotEncoder`:
    
    1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index
    2. change `includeFirst` to `dropLast` and leave the default to `true`. There are couple benefits:
    
        a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm)
        b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1.
        c. If users use `StringIndex`, the last element is the least frequent one.
    
    Sorry for including two changes in one PR! I'll update the user guide in another PR.
    
    jkbradley sryza
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #6466 from mengxr/SPARK-7912 and squashes the following commits:
    
    a280dca [Xiangrui Meng] fix tests
    d8f234d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7912
    171b276 [Xiangrui Meng] mention the difference between our impl vs sklearn's
    00dfd96 [Xiangrui Meng] update OneHotEncoder in Python
    208ddad [Xiangrui Meng] update OneHotEncoder to handle ML attributes and change includeFirst to dropLast
feature.py 37.56 KiB