Skip to content
  • wm624@hotmail.com's avatar
    9ac05225
    [SPARK-19319][SPARKR] SparkR Kmeans summary returns error when the cluster size doesn't equal to k · 9ac05225
    wm624@hotmail.com authored
    ## What changes were proposed in this pull request
    
    When Kmeans using initMode = "random" and some random seed, it is possible the actual cluster size doesn't equal to the configured `k`.
    
    In this case, summary(model) returns error due to the number of cols of coefficient matrix doesn't equal to k.
    
    Example:
    >  col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
    >   col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
    >   col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
    >   cols <- as.data.frame(cbind(col1, col2, col3))
    >   df <- createDataFrame(cols)
    >
    >   model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10,  initMode = "random", seed = 22222, tol = 1E-5)
    >
    > summary(model2)
    Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3")) :
      length of 'dimnames' [2] not equal to array extent
    In addition: Warning message:
    In matrix(coefficients, ncol = k) :
      data length [9] is not a sub-multiple or multiple of the number of rows [2]
    
    Fix: Get the actual cluster size in the summary and use it to build the coefficient matrix.
    ## How was this patch tested?
    
    Add unit tests.
    
    Author: wm624@hotmail.com <wm624@hotmail.com>
    
    Closes #16666 from wangmiao1981/kmeans.
    9ac05225
    [SPARK-19319][SPARKR] SparkR Kmeans summary returns error when the cluster size doesn't equal to k
    wm624@hotmail.com authored
    ## What changes were proposed in this pull request
    
    When Kmeans using initMode = "random" and some random seed, it is possible the actual cluster size doesn't equal to the configured `k`.
    
    In this case, summary(model) returns error due to the number of cols of coefficient matrix doesn't equal to k.
    
    Example:
    >  col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
    >   col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
    >   col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
    >   cols <- as.data.frame(cbind(col1, col2, col3))
    >   df <- createDataFrame(cols)
    >
    >   model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10,  initMode = "random", seed = 22222, tol = 1E-5)
    >
    > summary(model2)
    Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3")) :
      length of 'dimnames' [2] not equal to array extent
    In addition: Warning message:
    In matrix(coefficients, ncol = k) :
      data length [9] is not a sub-multiple or multiple of the number of rows [2]
    
    Fix: Get the actual cluster size in the summary and use it to build the coefficient matrix.
    ## How was this patch tested?
    
    Add unit tests.
    
    Author: wm624@hotmail.com <wm624@hotmail.com>
    
    Closes #16666 from wangmiao1981/kmeans.
Loading