Skip to content
Snippets Groups Projects
  • freeman's avatar
    6c6f3257
    [SPARK-5089][PYSPARK][MLLIB] Fix vector convert · 6c6f3257
    freeman authored
    This is a small change addressing a potentially significant bug in how PySpark + MLlib handles non-float64 numpy arrays. The automatic conversion to `DenseVector` that occurs when passing RDDs to MLlib algorithms in PySpark should automatically upcast to float64s, but currently this wasn't actually happening. As a result, non-float64 would be silently parsed inappropriately during SerDe, yielding erroneous results when running, for example, KMeans.
    
    The PR includes the fix, as well as a new test for the correct conversion behavior.
    
    davies
    
    Author: freeman <the.freeman.lab@gmail.com>
    
    Closes #3902 from freeman-lab/fix-vector-convert and squashes the following commits:
    
    764db47 [freeman] Add a test for proper conversion behavior
    704f97e [freeman] Return array after changing type
    6c6f3257
    History
    [SPARK-5089][PYSPARK][MLLIB] Fix vector convert
    freeman authored
    This is a small change addressing a potentially significant bug in how PySpark + MLlib handles non-float64 numpy arrays. The automatic conversion to `DenseVector` that occurs when passing RDDs to MLlib algorithms in PySpark should automatically upcast to float64s, but currently this wasn't actually happening. As a result, non-float64 would be silently parsed inappropriately during SerDe, yielding erroneous results when running, for example, KMeans.
    
    The PR includes the fix, as well as a new test for the correct conversion behavior.
    
    davies
    
    Author: freeman <the.freeman.lab@gmail.com>
    
    Closes #3902 from freeman-lab/fix-vector-convert and squashes the following commits:
    
    764db47 [freeman] Add a test for proper conversion behavior
    704f97e [freeman] Return array after changing type
linalg.py 20.96 KiB