Skip to content
Snippets Groups Projects
  • Liang-Chi Hsieh's avatar
    12206058
    [SPARK-20214][ML] Make sure converted csc matrix has sorted indices · 12206058
    Liang-Chi Hsieh authored
    ## What changes were proposed in this pull request?
    
    `_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:
    
        from scipy.sparse import lil_matrix
        lil = lil_matrix((4, 1))
        lil[1, 0] = 1
        lil[3, 0] = 2
        _convert_to_vector(lil.todok())
    
        File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector
          return SparseVector(l.shape[0], csc.indices, csc.data)
        File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__
          % (self.indices[i], self.indices[i + 1]))
        TypeError: Indices 3 and 1 are not strictly increasing
    
    A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices:
    
        >>> from scipy.sparse import lil_matrix
        >>> lil = lil_matrix((4, 1))
        >>> lil[1, 0] = 1
        >>> lil[3, 0] = 2
        >>> dok = lil.todok()
        >>> csc = dok.tocsc()
        >>> csc.has_sorted_indices
        0
        >>> csc.indices
        array([3, 1], dtype=int32)
    
    I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #17532 from viirya/make-sure-sorted-indices.
    12206058
    History
    [SPARK-20214][ML] Make sure converted csc matrix has sorted indices
    Liang-Chi Hsieh authored
    ## What changes were proposed in this pull request?
    
    `_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:
    
        from scipy.sparse import lil_matrix
        lil = lil_matrix((4, 1))
        lil[1, 0] = 1
        lil[3, 0] = 2
        _convert_to_vector(lil.todok())
    
        File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector
          return SparseVector(l.shape[0], csc.indices, csc.data)
        File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__
          % (self.indices[i], self.indices[i + 1]))
        TypeError: Indices 3 and 1 are not strictly increasing
    
    A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices:
    
        >>> from scipy.sparse import lil_matrix
        >>> lil = lil_matrix((4, 1))
        >>> lil[1, 0] = 1
        >>> lil[3, 0] = 2
        >>> dok = lil.todok()
        >>> csc = dok.tocsc()
        >>> csc.has_sorted_indices
        0
        >>> csc.indices
        array([3, 1], dtype=int32)
    
    I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #17532 from viirya/make-sure-sorted-indices.