Skip to content
Snippets Groups Projects
  1. Feb 22, 2017
    • Zheng RuiFeng's avatar
      [SPARK-19679][ML] Destroy broadcasted object without blocking · bf7bb497
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Destroy broadcasted object without blocking
      use `find mllib -name '*.scala' | xargs -i bash -c 'egrep "destroy" -n {} && echo {}'`
      
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #17016 from zhengruifeng/destroy_without_block.
      bf7bb497
    • Zheng RuiFeng's avatar
      [SPARK-19694][ML] Add missing 'setTopicDistributionCol' for LDAModel · ef3c7353
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Add missing 'setTopicDistributionCol' for LDAModel
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #17021 from zhengruifeng/lda_outputCol.
      ef3c7353
  2. Feb 19, 2017
  3. Feb 18, 2017
  4. Feb 15, 2017
    • Yun Ni's avatar
      [SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive Hashing · 08c1972a
      Yun Ni authored
      ## What changes were proposed in this pull request?
      This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH.
      
      ## How was this patch tested?
      API and examples are tested using spark-submit:
      `bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py`
      `bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py`
      
      User guide changes are generated and manually inspected:
      `SKIP_API=1 jekyll build`
      
      Author: Yun Ni <yunn@uber.com>
      Author: Yanbo Liang <ybliang8@gmail.com>
      Author: Yunni <Euler57721@gmail.com>
      
      Closes #16715 from Yunni/spark-18080.
      08c1972a
    • wm624@hotmail.com's avatar
      [SPARK-19456][SPARKR] Add LinearSVC R API · 3973403d
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      Linear SVM classifier is newly added into ML and python API has been added. This JIRA is to add R side API.
      
      Marked as WIP, as I am designing unit tests.
      
      ## How was this patch tested?
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16800 from wangmiao1981/svc.
      3973403d
  5. Feb 14, 2017
    • sureshthalamati's avatar
      [SPARK-19318][SQL] Fix to treat JDBC connection properties specified by the... · f48c5a57
      sureshthalamati authored
      [SPARK-19318][SQL] Fix to treat JDBC connection properties specified by the user in case-sensitive manner.
      
      ## What changes were proposed in this pull request?
      The reason for test failure is that the property “oracle.jdbc.mapDateToTimestamp” set by the test was getting converted into all lower case. Oracle database expects this property in case-sensitive manner.
      
      This test was passing in previous releases because connection properties were sent as user specified for the test case scenario. Fixes to handle all option uniformly in case-insensitive manner, converted the JDBC connection properties also to lower case.
      
      This PR  enhances CaseInsensitiveMap to keep track of input case-sensitive keys , and uses those when creating connection properties that are passed to the JDBC connection.
      
      Alternative approach PR https://github.com/apache/spark/pull/16847  is to pass original input keys to JDBC data source by adding check in the  Data source class and handle case-insensitivity in the JDBC source code.
      
      ## How was this patch tested?
      Added new test cases to JdbcSuite , and OracleIntegrationSuite. Ran docker integration tests passed on my laptop, all tests passed successfully.
      
      Author: sureshthalamati <suresh.thalamati@gmail.com>
      
      Closes #16891 from sureshthalamati/jdbc_case_senstivity_props_fix-SPARK-19318.
      f48c5a57
  6. Feb 10, 2017
    • sueann's avatar
      [SPARK-18613][ML] make spark.mllib LDA dependencies in spark.ml LDA private · 3a43ae7c
      sueann authored
      ## What changes were proposed in this pull request?
      spark.ml.*LDAModel classes were exposing spark.mllib LDA models via protected methods. Made them package (clustering) private.
      
      ## How was this patch tested?
      ```
      build/sbt doc  # "millib.clustering" no longer appears in the docs for *LDA* classes
      build/sbt compile  # compiles
      build/sbt
      > mllib/testOnly   # tests pass
      ```
      
      Author: sueann <sueann@databricks.com>
      
      Closes #16860 from sueann/SPARK-18613.
      3a43ae7c
  7. Feb 08, 2017
    • actuaryzhang's avatar
      [SPARK-19400][ML] Allow GLM to handle intercept only model · 1aeb9f6c
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      Intercept-only GLM is failing for non-Gaussian family because of reducing an empty array in IWLS. The following code `val maxTolOfCoefficients = oldCoefficients.toArray.reduce { (x, y) => math.max(math.abs(x), math.abs(y))` fails in the intercept-only model because `oldCoefficients` is empty. This PR fixes this issue.
      
      yanboliang srowen imatiach-msft zhengruifeng
      
      ## How was this patch tested?
      New test for intercept only model.
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #16740 from actuaryzhang/interceptOnly.
      1aeb9f6c
  8. Feb 07, 2017
    • gatorsmile's avatar
      [SPARK-19397][SQL] Make option names of LIBSVM and TEXT case insensitive · e33aaa2a
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Prior to Spark 2.1, the option names are case sensitive for all the formats. Since Spark 2.1, the option key names become case insensitive except the format `Text` and `LibSVM `. This PR is to fix these issues.
      
      Also, add a check to know whether the input option vector type is legal for `LibSVM`.
      
      ### How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16737 from gatorsmile/libSVMTextOptions.
      e33aaa2a
  9. Feb 05, 2017
    • gatorsmile's avatar
      [SPARK-19279][SQL] Infer Schema for Hive Serde Tables and Block Creating a... · 65b10ffb
      gatorsmile authored
      [SPARK-19279][SQL] Infer Schema for Hive Serde Tables and Block Creating a Hive Table With an Empty Schema
      
      ### What changes were proposed in this pull request?
      So far, we allow users to create a table with an empty schema: `CREATE TABLE tab1`. This could break many code paths if we enable it. Thus, we should follow Hive to block it.
      
      For Hive serde tables, some serde libraries require the specified schema and record it in the metastore. To get the list, we need to check `hive.serdes.using.metastore.for.schema,` which contains a list of serdes that require user-specified schema. The default values are
      
      - org.apache.hadoop.hive.ql.io.orc.OrcSerde
      - org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      - org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe
      - org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe
      - org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe
      - org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe
      - org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
      - org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
      
      ### How was this patch tested?
      Added test cases for both Hive and data source tables
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16636 from gatorsmile/fixEmptyTableSchema.
      65b10ffb
    • Asher Krim's avatar
      [SPARK-19247][ML] Save large word2vec models · b3e89802
      Asher Krim authored
      ## What changes were proposed in this pull request?
      
      * save word2vec models as distributed files rather than as one large datum. Backwards compatibility with the previous save format is maintained by checking for the "wordIndex" column
      * migrate the fix for loading large models (SPARK-11994) to ml word2vec
      
      ## How was this patch tested?
      
      Tested loading the new and old formats locally
      
      srowen yanboliang MLnick
      
      Author: Asher Krim <akrim@hubspot.com>
      
      Closes #16607 from Krimit/saveLargeModels.
      b3e89802
  10. Feb 02, 2017
  11. Feb 01, 2017
    • hyukjinkwon's avatar
      [SPARK-19402][DOCS] Support LaTex inline formula correctly and fix warnings in... · f1a1f260
      hyukjinkwon authored
      [SPARK-19402][DOCS] Support LaTex inline formula correctly and fix warnings in Scala/Java APIs generation
      
      ## What changes were proposed in this pull request?
      
      This PR proposes three things as below:
      
      - Support LaTex inline-formula, `\( ... \)` in Scala API documentation
        It seems currently,
      
        ```
        \( ... \)
        ```
      
        are rendered as they are, for example,
      
        <img width="345" alt="2017-01-30 10 01 13" src="https://cloud.githubusercontent.com/assets/6477701/22423960/ab37d54a-e737-11e6-9196-4f6229c0189c.png">
      
        It seems mistakenly more backslashes were added.
      
      - Fix warnings Scaladoc/Javadoc generation
        This PR fixes t two types of warnings as below:
      
        ```
        [warn] .../spark/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala:335: Could not find any member to link for "UnsupportedOperationException".
        [warn]   /**
        [warn]   ^
        ```
      
        ```
        [warn] .../spark/sql/core/src/main/scala/org/apache/spark/sql/internal/VariableSubstitution.scala:24: Variable var undefined in comment for class VariableSubstitution in class VariableSubstitution
        [warn]  * `${var}`, `${system:var}` and `${env:var}`.
        [warn]      ^
        ```
      
      - Fix Javadoc8 break
        ```
        [error] .../spark/mllib/target/java/org/apache/spark/ml/PredictionModel.java:7: error: reference not found
        [error]  *                       E.g., {link VectorUDT} for vector features.
        [error]                                       ^
        [error] .../spark/mllib/target/java/org/apache/spark/ml/PredictorParams.java:12: error: reference not found
        [error]    *                          E.g., {link VectorUDT} for vector features.
        [error]                                            ^
        [error] .../spark/mllib/target/java/org/apache/spark/ml/Predictor.java:10: error: reference not found
        [error]  *                       E.g., {link VectorUDT} for vector features.
        [error]                                       ^
        [error] .../spark/sql/hive/target/java/org/apache/spark/sql/hive/HiveAnalysis.java:5: error: reference not found
        [error]  * Note that, this rule must be run after {link PreprocessTableInsertion}.
        [error]                                                  ^
        ```
      
      ## How was this patch tested?
      
      Manually via `sbt unidoc` and `jeykil build`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16741 from HyukjinKwon/warn-and-break.
      Unverified
      f1a1f260
  12. Jan 31, 2017
    • wm624@hotmail.com's avatar
      [SPARK-19319][SPARKR] SparkR Kmeans summary returns error when the cluster size doesn't equal to k · 9ac05225
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request
      
      When Kmeans using initMode = "random" and some random seed, it is possible the actual cluster size doesn't equal to the configured `k`.
      
      In this case, summary(model) returns error due to the number of cols of coefficient matrix doesn't equal to k.
      
      Example:
      >  col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
      >   col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
      >   col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
      >   cols <- as.data.frame(cbind(col1, col2, col3))
      >   df <- createDataFrame(cols)
      >
      >   model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10,  initMode = "random", seed = 22222, tol = 1E-5)
      >
      > summary(model2)
      Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3")) :
        length of 'dimnames' [2] not equal to array extent
      In addition: Warning message:
      In matrix(coefficients, ncol = k) :
        data length [9] is not a sub-multiple or multiple of the number of rows [2]
      
      Fix: Get the actual cluster size in the summary and use it to build the coefficient matrix.
      ## How was this patch tested?
      
      Add unit tests.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16666 from wangmiao1981/kmeans.
      9ac05225
    • Bryan Cutler's avatar
      [SPARK-17161][PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to... · 57d70d26
      Bryan Cutler authored
      [SPARK-17161][PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create Py4J JavaArrays
      
      ## What changes were proposed in this pull request?
      
      Adding convenience function to Python `JavaWrapper` so that it is easy to create a Py4J JavaArray that is compatible with current class constructors that have a Scala `Array` as input so that it is not necessary to have a Java/Python friendly constructor.  The function takes a Java class as input that is used by Py4J to create the Java array of the given class.  As an example, `OneVsRest` has been updated to use this and the alternate constructor is removed.
      
      ## How was this patch tested?
      
      Added unit tests for the new convenience function and updated `OneVsRest` doctests which use this to persist the model.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #14725 from BryanCutler/pyspark-new_java_array-CountVectorizer-SPARK-17161.
      57d70d26
  13. Jan 28, 2017
  14. Jan 27, 2017
    • wm624@hotmail.com's avatar
      [SPARK-19336][ML][PYSPARK] LinearSVC Python API · bb1a1fe0
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      Add Python API for the newly added LinearSVC algorithm.
      
      ## How was this patch tested?
      
      Add new doc string test.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16694 from wangmiao1981/ser.
      bb1a1fe0
    • actuaryzhang's avatar
      [SPARK-18929][ML] Add Tweedie distribution in GLM · 4172ff80
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      I propose to add the full Tweedie family into the GeneralizedLinearRegression model. The Tweedie family is characterized by a power variance function. Currently supported distributions such as Gaussian, Poisson and Gamma families are a special case of the Tweedie https://en.wikipedia.org/wiki/Tweedie_distribution.
      
      yanboliang srowen sethah
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      Author: Wayne Zhang <actuaryzhang10@gmail.com>
      
      Closes #16344 from actuaryzhang/tweedie.
      4172ff80
  15. Jan 26, 2017
    • wm624@hotmail.com's avatar
      [SPARK-18821][SPARKR] Bisecting k-means wrapper in SparkR · c0ba2843
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      Add R wrapper for bisecting Kmeans.
      
      As JIRA is down, I will update title to link with corresponding JIRA later.
      
      ## How was this patch tested?
      
      Add new unit tests.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16566 from wangmiao1981/bk.
      c0ba2843
    • WeichenXu's avatar
      [SPARK-18218][ML][MLLIB] Reduce shuffled data size of BlockMatrix... · 1191fe26
      WeichenXu authored
      [SPARK-18218][ML][MLLIB] Reduce shuffled data size of BlockMatrix multiplication and solve potential OOM and low parallelism usage problem By split middle dimension in matrix multiplication
      
      ## What changes were proposed in this pull request?
      
      ### The problem in current block matrix mulitiplication
      
      As in JIRA https://issues.apache.org/jira/browse/SPARK-18218 described, block matrix multiplication in spark may cause some problem, suppose we have `M*N` dimensions matrix A multiply `N*P` dimensions matrix B, when N is much larger than M and P, then the following problem may occur:
      - when the middle dimension N is too large, it will cause reducer OOM.
      - even if OOM do not occur, it will still cause parallism too low.
      - when N is much large than M and P, and matrix A and B have many partitions, it may cause too many partition on M and P dimension, it will cause much larger shuffled data size. (I will expain this in detail in the following.)
      
      ### Key point of my improvement
      
      In this PR, I introduce `midDimSplitNum` parameter, and improve the algorithm, to resolve this problem.
      
      In order to understand the improvement in this PR, first let me give a simple case to explain how the current mulitiplication works and what cause the problems above:
      
      suppose we have block matrix A, contains 200 blocks (`2 numRowBlocks * 100 numColBlocks`), blocks arranged in 2 rows, 100 cols:
      ```
      A00 A01 A02 ... A0,99
      A10 A11 A12 ... A1,99
      ```
      and we have block matrix B, also contains 200 blocks (`100 numRowBlocks * 2 numColBlocks`), blocks arranged in 100 rows, 2 cols:
      ```
      B00    B01
      B10    B11
      B20    B21
      ...
      B99,0  B99,1
      ```
      Suppose all blocks in the two matrices are dense for now.
      Now we call A.multiply(B), suppose the generated `resultPartitioner` contains 2 rowPartitions and 2 colPartitions (can't be more partitions because the result matrix only contains `2 * 2` blocks), the current algorithm will contains two shuffle steps:
      
      **step-1**
      Step-1 will generate 4 reducer, I tag them as reducer-00, reducer-01, reducer-10, reducer-11, and shuffle data as following:
      ```
      A00 A01 A02 ... A0,99
      B00 B10 B20 ... B99,0    shuffled into reducer-00
      
      A00 A01 A02 ... A0,99
      B01 B11 B21 ... B99,1    shuffled into reducer-01
      
      A10 A11 A12 ... A1,99
      B00 B10 B20 ... B99,0    shuffled into reducer-10
      
      A10 A11 A12 ... A1,99
      B01 B11 B21 ... B99,1    shuffled into reducer-11
      ```
      
      and the shuffling above is a `cogroup` transform, note that each reducer contains **only one group**.
      
      **step-2**
      Step-2 will do an `aggregateByKey` transform on the result of step-1, will also generate 4 reducers, and generate the final result RDD, contains 4 partitions, each partition contains one block.
      
      The main problems are in step-1. Now we have only 4 reducers, but matrix A and B have 400 blocks in total, obviously the reducer number is too small.
      and, we can see that, each reducer contains only one group(the group concept in `coGroup` transform), each group contains 200 blocks. This is terrible because we know that `coGroup` transformer will load each group into memory when computing. It is un-extensable in the algorithm level. Suppose matrix A has 10000 cols blocks or more instead of 100? Than each reducer will load 20000 blocks into memory. It will easily cause reducer OOM.
      
      This PR try to resolve the problem described above.
      When matrix A with dimension M * N multiply matrix B with dimension N * P, the middle dimension N is the keypoint. If N is large, the current mulitiplication implementation works badly.
      In this PR, I introduce a `numMidDimSplits` parameter, represent how many splits it will cut on the middle dimension N.
      Still using the example described above, now we set `numMidDimSplits = 10`, now we can generate 40 reducers in **step-1**:
      
      the reducer-ij above now will be splited into 10 reducers: reducer-ij0, reducer-ij1, ... reducer-ij9, each reducer will receive 20 blocks.
      now the shuffle works as following:
      
      **reducer-000 to reducer-009**
      ```
      A0,0 A0,10 A0,20 ... A0,90
      B0,0 B10,0 B20,0 ... B90,0    shuffled into reducer-000
      
      A0,1 A0,11 A0,21 ... A0,91
      B1,0 B11,0 B21,0 ... B91,0    shuffled into reducer-001
      
      A0,2 A0,12 A0,22 ... A0,92
      B2,0 B12,0 B22,0 ... B92,0    shuffled into reducer-002
      
      ...
      
      A0,9 A0,19 A0,29 ... A0,99
      B9,0 B19,0 B29,0 ... B99,0    shuffled into reducer-009
      ```
      
      **reducer-010 to reducer-019**
      ```
      A0,0 A0,10 A0,20 ... A0,90
      B0,1 B10,1 B20,1 ... B90,1    shuffled into reducer-010
      
      A0,1 A0,11 A0,21 ... A0,91
      B1,1 B11,1 B21,1 ... B91,1    shuffled into reducer-011
      
      A0,2 A0,12 A0,22 ... A0,92
      B2,1 B12,1 B22,1 ... B92,1    shuffled into reducer-012
      
      ...
      
      A0,9 A0,19 A0,29 ... A0,99
      B9,1 B19,1 B29,1 ... B99,1    shuffled into reducer-019
      ```
      
      **reducer-100 to reducer-109** and **reducer-110 to reducer-119** is similar to the above, I omit to write them out.
      
      ### API for this optimized algorithm
      
      I add a new API as following:
      ```
        def multiply(
            other: BlockMatrix,
            numMidDimSplits: Int // middle dimension split number, expained above
      ): BlockMatrix
      ```
      
      ### Shuffled data size analysis (compared under the same parallelism)
      
      The optimization has some subtle influence on the total shuffled data size. Appropriate `numMidDimSplits` will significantly reduce the shuffled data size,
      but too large `numMidDimSplits` may increase the shuffled data in reverse. For now I don't want to introduce formula to make thing too complex, I only use a simple case to represent it here:
      
      Suppose we have two same size square matrices X and Y, both have `16 numRowBlocks * 16 numColBlocks`. X and Y are both dense matrix. Now let me analysis the shuffling data size in the following case:
      
      **case 1: X and Y both partitioned in 16 rowPartitions and 16 colPartitions, numMidDimSplits = 1**
      ShufflingDataSize = (16 * 16 * (16 + 16) + 16 * 16) blocks = 8448 blocks
      parallelism = 16 * 16 * 1 = 256 //use step-1 reducers number as the parallism because it cost most of the computation time in this algorithm.
      
      **case 2: X and Y both partitioned in 8 rowPartitions and 8 colPartitions, numMidDimSplits = 4**
      ShufflingDataSize = (8 * 8 * (32 + 32) + 16 * 16 * 4) blocks = 5120 blocks
      parallelism = 8 * 8 * 4 = 256 //use step-1 reducers number as the parallism because it cost most of the computation time in this algorithm.
      
      **The two cases above all have parallism = 256**, case 1 `numMidDimSplits = 1` is equivalent with current implementation in mllib, but case 2 shuffling data is 60.6% of case 1, **it shows that under the same parallelism, proper `numMidDimSplits` will significantly reduce the shuffling data size**.
      
      ## How was this patch tested?
      
      Test suites added.
      Running result:
      ![blockmatrix](https://cloud.githubusercontent.com/assets/19235986/21600989/5e162cc2-d1bf-11e6-868c-0ec29190b605.png)
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #15730 from WeichenXu123/optim_block_matrix.
      1191fe26
  16. Jan 25, 2017
    • sethah's avatar
      [SPARK-19313][ML][MLLIB] GaussianMixture should limit the number of features · 0e821ec6
      sethah authored
      ## What changes were proposed in this pull request?
      
      The following test will fail on current master
      
      ````scala
      test("gmm fails on high dimensional data") {
          val ctx = spark.sqlContext
          import ctx.implicits._
          val df = Seq(
            Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4), Array(3.0, 8.0)),
            Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5), Array(4.0, 9.0)))
            .map(Tuple1.apply).toDF("features")
          val gm = new GaussianMixture()
          intercept[IllegalArgumentException] {
            gm.fit(df)
          }
        }
      ````
      
      Instead, you'll get an `ArrayIndexOutOfBoundsException` or something similar for MLlib. That's because the covariance matrix allocates an array of `numFeatures * numFeatures`, and in this case we get integer overflow. While there is currently a warning that the algorithm does not perform well for high number of features, we should perform an appropriate check to communicate this limitation to users.
      
      This patch adds a `require(numFeatures < GaussianMixture.MAX_NUM_FEATURES)` check to ML and MLlib algorithms. For the feature limitation, we can limit it such that we do not get numerical overflow to something like `math.sqrt(Integer.MaxValue).toInt` (about 46k) which eliminates the cryptic error. However in, for example WLS, we need to collect an array on the order of `numFeatures * numFeatures` to the driver and we therefore limit to 4096 features. We may want to keep that convention here for consistency.
      
      ## How was this patch tested?
      Unit tests in ML and MLlib.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #16661 from sethah/gmm_high_dim.
      0e821ec6
  17. Jan 24, 2017
    • Ilya Matiach's avatar
      [SPARK-18036][ML][MLLIB] Fixing decision trees handling edge cases · d9783380
      Ilya Matiach authored
      ## What changes were proposed in this pull request?
      
      Decision trees/GBT/RF do not handle edge cases such as constant features or empty features.
      In the case of constant features we choose any arbitrary split instead of failing with a cryptic error message.
      In the case of empty features we fail with a better error message stating:
      DecisionTree requires number of features > 0, but was given an empty features vector
      Instead of the cryptic error message:
      java.lang.UnsupportedOperationException: empty.max
      
      ## How was this patch tested?
      
      Unit tests are added in the patch for:
      DecisionTreeRegressor
      GBTRegressor
      Random Forest Regressor
      
      Author: Ilya Matiach <ilmat@microsoft.com>
      
      Closes #16377 from imatiach-msft/ilmat/fix-decision-tree.
      d9783380
    • Souljoy Zhuo's avatar
      delete useless var “j” · cca86800
      Souljoy Zhuo authored
      the var “j” defined in "var j = 0" is useless for “def compress”
      
      Author: Souljoy Zhuo <zhuoshoujie@126.com>
      
      Closes #16676 from xiaoyesoso/patch-1.
      Unverified
      cca86800
  18. Jan 23, 2017
    • Zheng RuiFeng's avatar
      [SPARK-17747][ML] WeightCol support non-double numeric datatypes · 49f5b0ae
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      1, add test for `WeightCol` in `MLTestingUtils.checkNumericTypes`
      2, move datatype cast to `Predict.fit`, and supply algos' `train()` with casted dataframe
      ## How was this patch tested?
      
      local tests in spark-shell and unit tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15314 from zhengruifeng/weightCol_support_int.
      49f5b0ae
    • Ilya Matiach's avatar
      [SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm failing in edge case · 5b258b8b
      Ilya Matiach authored
      [SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm failing in edge case where no children exist in updateAssignments
      
      ## What changes were proposed in this pull request?
      
      Fix a bug in which BisectingKMeans fails with error:
      java.util.NoSuchElementException: key not found: 166
              at scala.collection.MapLike$class.default(MapLike.scala:228)
              at scala.collection.AbstractMap.default(Map.scala:58)
              at scala.collection.MapLike$class.apply(MapLike.scala:141)
              at scala.collection.AbstractMap.apply(Map.scala:58)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
              at scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231)
              at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
              at scala.collection.immutable.List.foldLeft(List.scala:84)
              at scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125)
              at scala.collection.immutable.List.reduceLeft(List.scala:84)
              at scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231)
              at scala.collection.AbstractTraversable.minBy(Traversable.scala:105)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334)
              at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
              at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
      
      ## How was this patch tested?
      
      The dataset was run against the code change to verify that the code works.  I will try to add unit tests to the code.
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Ilya Matiach <ilmat@microsoft.com>
      
      Closes #16355 from imatiach-msft/ilmat/fix-kmeans.
      5b258b8b
    • z001qdp's avatar
      [SPARK-17455][MLLIB] Improve PAVA implementation in IsotonicRegression · c8aea744
      z001qdp authored
      ## What changes were proposed in this pull request?
      
      New implementation of the Pool Adjacent Violators Algorithm (PAVA) in mllib.IsotonicRegression, which used under the hood by ml.regression.IsotonicRegression. The previous implementation could have factorial complexity in the worst case. This implementation, which closely follows those in scikit-learn and the R `iso` package, runs in quadratic time in the worst case.
      ## How was this patch tested?
      
      Existing unit tests in both `mllib` and `ml` passed before and after this patch. Scaling properties were tested by running the `poolAdjacentViolators` method in [scala-benchmarking-template](https://github.com/sirthias/scala-benchmarking-template) with the input generated by
      
      ``` scala
      val x = (1 to length).toArray.map(_.toDouble)
      val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 else yi}
      val w = Array.fill(length)(1d)
      
      val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, x), w) => (y, x, w)}
      ```
      
      Before this patch:
      
      | Input Length | Time (us) |
      | --: | --: |
      | 100 | 1.35 |
      | 200 | 3.14 |
      | 400 | 116.10 |
      | 800 | 2134225.90 |
      
      After this patch:
      
      | Input Length | Time (us) |
      | --: | --: |
      | 100 | 1.25 |
      | 200 | 2.53 |
      | 400 | 5.86 |
      | 800 | 10.55 |
      
      Benchmarking was also performed with randomly-generated y values, with similar results.
      
      Author: z001qdp <Nicholas.Eggert@target.com>
      
      Closes #15018 from neggert/SPARK-17455-isoreg-algo.
      c8aea744
    • Yuhao's avatar
      [SPARK-14709][ML] spark.ml API for linear SVM · 4a11d029
      Yuhao authored
      ## What changes were proposed in this pull request?
      
      jira: https://issues.apache.org/jira/browse/SPARK-14709
      
      Provide API for SVM algorithm for DataFrames. As discussed in jira, the initial implementation uses OWL-QN with Hinge loss function.
      The API should mimic existing spark.ml.classification APIs.
      Currently only Binary Classification is supported. Multinomial support can be added in this or following release.
      ## How was this patch tested?
      
      new unit tests and simple manual test
      
      Author: Yuhao <yuhao.yang@intel.com>
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #15211 from hhbyyh/mlsvm.
      4a11d029
    • actuaryzhang's avatar
      [SPARK-19155][ML] Make family case insensitive in GLM · f067acef
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      This is a supplement to PR #16516 which did not make the value from `getFamily` case insensitive. Current tests of poisson/binomial glm with weight fail when specifying 'Poisson' or 'Binomial', because the calculation of `dispersion` and `pValue` checks the value of family retrieved from `getFamily`
      ```
      model.getFamily == Binomial.name || model.getFamily == Poisson.name
      ```
      
      ## How was this patch tested?
      Update existing tests for 'Poisson' and 'Binomial'.
      
      yanboliang felixcheung imatiach-msft
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #16675 from actuaryzhang/family.
      f067acef
  19. Jan 21, 2017
  20. Jan 19, 2017
    • Zheng RuiFeng's avatar
      [SPARK-14272][ML] Add Loglikelihood in GaussianMixtureSummary · 8ccca917
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      add loglikelihood in GMM.summary
      
      ## How was this patch tested?
      
      added tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      Author: Ruifeng Zheng <ruifengz@foxmail.com>
      
      Closes #12064 from zhengruifeng/gmm_metric.
      8ccca917
  21. Jan 18, 2017
    • Ilya Matiach's avatar
      [SPARK-14975][ML] Fixed GBTClassifier to predict probability per training... · fe409f31
      Ilya Matiach authored
      [SPARK-14975][ML] Fixed GBTClassifier to predict probability per training instance and fixed interfaces
      
      ## What changes were proposed in this pull request?
      
      For all of the classifiers in MLLib we can predict probabilities except for GBTClassifier.
      Also, all classifiers inherit from ProbabilisticClassifier but GBTClassifier strangely inherits from Predictor, which is a bug.
      This change corrects the interface and adds the ability for the classifier to give a probabilities vector.
      
      ## How was this patch tested?
      
      The basic ML tests were run after making the changes.  I've marked this as WIP as I need to add more tests.
      
      Author: Ilya Matiach <ilmat@microsoft.com>
      
      Closes #16441 from imatiach-msft/ilmat/fix-GBT.
      fe409f31
    • uncleGen's avatar
      [SPARK-19227][SPARK-19251] remove unused imports and outdated comments · eefdf9f9
      uncleGen authored
      ## What changes were proposed in this pull request?
      remove ununsed imports and outdated comments, and fix some minor code style issue.
      
      ## How was this patch tested?
      existing ut
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #16591 from uncleGen/SPARK-19227.
      Unverified
      eefdf9f9
  22. Jan 17, 2017
    • Zheng RuiFeng's avatar
      [SPARK-18206][ML] Add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiR · e7f982b2
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiR
      ## How was this patch tested?
      
      local test in spark-shell
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      Author: Ruifeng Zheng <ruifengz@foxmail.com>
      
      Closes #15671 from zhengruifeng/lir_instr.
      e7f982b2
    • hyukjinkwon's avatar
      [SPARK-3249][DOC] Fix links in ScalaDoc that cause warning messages in `sbt/sbt unidoc` · 6c00c069
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix ambiguous link warnings by simply making them as code blocks for both javadoc and scaladoc.
      
      ```
      [warn] .../spark/core/src/main/scala/org/apache/spark/Accumulator.scala:20: The link target "SparkContext#accumulator" is ambiguous. Several members fit the target:
      [warn] .../spark/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala:281: The link target "runMiniBatchSGD" is ambiguous. Several members fit the target:
      [warn] .../spark/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala:83: The link target "run" is ambiguous. Several members fit the target:
      ...
      ```
      
      This PR also fixes javadoc8 break as below:
      
      ```
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found
      [error]  * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product}
      [error]                                                   ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found
      [error]  * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product}
      [error]                                                                                ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found
      [error]  * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product}
      [error]                                                                                                ^
      [info] 3 errors
      ```
      
      ## How was this patch tested?
      
      Manually via `sbt unidoc > output.txt` and the checked it via `cat output.txt | grep ambiguous`
      
      and `sbt unidoc | grep error`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16604 from HyukjinKwon/SPARK-3249.
      Unverified
      6c00c069
  23. Jan 16, 2017
    • wm624@hotmail.com's avatar
      [SPARK-19066][SPARKR] SparkR LDA doesn't set optimizer correctly · 12c8c216
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE based on scala code.
      
      In addition, the `summary` method should bring back the results related to `DistributedLDAModel`.
      
      ## How was this patch tested?
      Manual tests by comparing with scala example.
      Modified the current unit test: fix the incorrect unit test and add necessary tests for `summary` method.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16464 from wangmiao1981/new.
      12c8c216
  24. Jan 13, 2017
    • wm624@hotmail.com's avatar
      [SPARK-19142][SPARKR] spark.kmeans should take seed, initSteps, and tol as parameters · 7f24a0b6
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans.
      
      Add missing parameters and corresponding document.
      
      Modified existing unit tests to take additional parameters.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16523 from wangmiao1981/kmeans.
      7f24a0b6
  25. Jan 12, 2017
    • wm624@hotmail.com's avatar
      [SPARK-19110][MLLIB][FOLLOWUP] Add a unit test for testing logPrior and... · c983267b
      wm624@hotmail.com authored
      [SPARK-19110][MLLIB][FOLLOWUP] Add a unit test for testing logPrior and logLikelihood of DistributedLDAModel in MLLIB
      
      ## What changes were proposed in this pull request?
      #16491 added the fix to mllib and a unit test to ml. This followup PR, add unit tests to mllib suite.
      
      ## How was this patch tested?
      Unit tests.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16524 from wangmiao1981/ldabug.
      c983267b
  26. Jan 10, 2017
    • Peng, Meng's avatar
      [SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor change · 32286ba6
      Peng, Meng authored
      ## What changes were proposed in this pull request?
      Add FDR test case in ml/feature/ChiSqSelectorSuite.
      Improve some comments in the code.
      This is a follow-up pr for #15212.
      
      ## How was this patch tested?
      ut
      
      Author: Peng, Meng <peng.meng@intel.com>
      
      Closes #16434 from mpjlu/fdr_fwe_update.
      Unverified
      32286ba6
Loading