Skip to content
Snippets Groups Projects
  1. Oct 20, 2016
    • Reynold Xin's avatar
      [SPARK-18021][SQL] Refactor file name specification for data sources · 7f9ec19e
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Currently each data source OutputWriter is responsible for specifying the entire file name for each file output. This, however, does not make any sense because we rely on file naming schemes for certain behaviors in Spark SQL, e.g. bucket id. The current approach allows individual data sources to break the implementation of bucketing.
      
      On the flip side, we also don't want to move file naming entirely out of data sources, because different data sources do want to specify different extensions.
      
      This patch divides file name specification into two parts: the first part is a prefix specified by the caller of OutputWriter (in WriteOutput), and the second part is the suffix that can be specified by the OutputWriter itself. Note that a side effect of this change is that now all file based data sources also support bucketing automatically.
      
      There are also some other minor cleanups:
      
      - Removed the UUID passed through generic Configuration string
      - Some minor rewrites for better clarity
      - Renamed "path" in multiple places to "stagingDir", to more accurately reflect its meaning
      
      ## How was this patch tested?
      This should be covered by existing data source tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15562 from rxin/SPARK-18021.
      7f9ec19e
  2. Oct 14, 2016
    • sethah's avatar
      [SPARK-17941][ML][TEST] Logistic regression tests should use sample weights. · de1c1ca5
      sethah authored
      ## What changes were proposed in this pull request?
      
      The sample weight testing for logistic regressions is not robust. Logistic regression suite already has many test cases comparing results to R glmnet. Since both libraries support sample weights, we should use sample weights in the test to increase coverage for sample weighting. This patch doesn't really add any code and makes the testing more complete.
      
      Also fixed some errors with the R code that was referenced in the test suit. Changed `standardization=T` to `standardize=T` since the former is invalid.
      
      ## How was this patch tested?
      
      Existing unit tests are modified. No non-test code is touched.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15488 from sethah/logreg_weight_tests.
      de1c1ca5
    • Peng's avatar
      [SPARK-17870][MLLIB][ML] Change statistic to pValue for SelectKBest and... · c8b612de
      Peng authored
      [SPARK-17870][MLLIB][ML] Change statistic to pValue for SelectKBest and SelectPercentile because of DoF difference
      
      ## What changes were proposed in this pull request?
      
      For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features.
      
      So we change statistic to pValue for SelectKBest and SelectPercentile
      
      ## How was this patch tested?
      change existing test
      
      Author: Peng <peng.meng@intel.com>
      
      Closes #15444 from mpjlu/chisqure-bug.
      Unverified
      c8b612de
    • Zheng RuiFeng's avatar
      [SPARK-14634][ML] Add BisectingKMeansSummary · a1b136d0
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Add BisectingKMeansSummary
      
      ## How was this patch tested?
      unit test
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12394 from zhengruifeng/biKMSummary.
      a1b136d0
  3. Oct 12, 2016
    • Yanbo Liang's avatar
      [SPARK-17835][ML][MLLIB] Optimize NaiveBayes mllib wrapper to eliminate extra pass on data · 21cb59f1
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      [SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077) copied the ```NaiveBayes``` implementation from mllib to ml and left mllib as a wrapper. However, there are some difference between mllib and ml to handle labels:
      * mllib allow input labels as {-1, +1}, however, ml assumes the input labels in range [0, numClasses).
      * mllib ```NaiveBayesModel``` expose ```labels``` but ml did not due to the assumption mention above.
      
      During the copy in [SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077), we use
      ```val labels = data.map(_.label).distinct().collect().sorted```
      to get the distinct labels firstly, and then encode the labels for training. It involves extra Spark job compared with the original implementation. Since ```NaiveBayes``` only do one pass aggregation during training, adding another one seems less efficient. We can get the labels in a single pass along with ```NaiveBayes``` training and send them to MLlib side.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15402 from yanboliang/spark-17835.
      21cb59f1
    • Sean Owen's avatar
      [SPARK-11560][MLLIB] Optimize KMeans implementation / remove 'runs' · 8d33e1e5
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      This is a revival of https://github.com/apache/spark/pull/14948 and related to https://github.com/apache/spark/pull/14937. This removes the 'runs' parameter, which has already been disabled, from the K-means implementation and further deprecates API methods that involve it.
      
      This also happens to resolve the issue that K-means should not return duplicate centers, meaning that it may return less than k centroids if not enough data is available.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15342 from srowen/SPARK-11560.
      Unverified
      8d33e1e5
  4. Oct 11, 2016
    • Yanbo Liang's avatar
      [SPARK-15153][ML][SPARKR] Fix SparkR spark.naiveBayes error when label is numeric type · 23405f32
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Fix SparkR ```spark.naiveBayes``` error when response variable of dataset is numeric type.
      See details and how to reproduce this bug at [SPARK-15153](https://issues.apache.org/jira/browse/SPARK-15153).
      
      ## How was this patch tested?
      Add unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15431 from yanboliang/spark-15153-2.
      23405f32
    • Yanbo Liang's avatar
      [SPARK-15957][ML] RFormula supports forcing to index label · 19401a20
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      ```RFormula``` will index label only when it is string type currently. If the label is numeric type and we use ```RFormula``` to present a classification model, there is no label attributes in label column metadata. The label attributes are useful when making prediction for classification, so we can force to index label by ```StringIndexer``` whether it is numeric or string type for classification. Then SparkR wrappers can extract label attributes from label column metadata successfully. This feature can help us to fix bug similar with [SPARK-15153](https://issues.apache.org/jira/browse/SPARK-15153).
      For regression, we will still to keep label as numeric type.
      In this PR, we add a param ```indexLabel``` to control whether to force to index label for ```RFormula```.
      
      ## How was this patch tested?
      Unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #13675 from yanboliang/spark-15957.
      19401a20
  5. Oct 10, 2016
    • sethah's avatar
      [SPARK-14610][ML] Remove superfluous split for continuous features in decision tree training · 03c40202
      sethah authored
      ## What changes were proposed in this pull request?
      
      A nonsensical split is produced from method `findSplitsForContinuousFeature` for decision trees. This PR removes the superfluous split and updates unit tests accordingly. Additionally, an assertion to check that the number of found splits is `> 0` is removed, and instead features with zero possible splits are ignored.
      
      ## How was this patch tested?
      
      A unit test was added to check that finding splits for a constant feature produces an empty array.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #12374 from sethah/SPARK-14610.
      03c40202
  6. Oct 07, 2016
    • wm624@hotmail.com's avatar
      [MINOR][ML] remove redundant comment in LogisticRegression · 471690f9
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      While adding R wrapper for LogisticRegression, I found one extra comment. It is minor and I just remove it.
      
      ## How was this patch tested?
      Unit tests
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #15391 from wangmiao1981/mlordoc.
      471690f9
    • Herman van Hovell's avatar
      [SPARK-17761][SQL] Remove MutableRow · 97594c29
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      In practice we cannot guarantee that an `InternalRow` is immutable. This makes the `MutableRow` almost redundant. This PR folds `MutableRow` into `InternalRow`.
      
      The code below illustrates the immutability issue with InternalRow:
      ```scala
      import org.apache.spark.sql.catalyst.InternalRow
      import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
      val struct = new GenericMutableRow(1)
      val row = InternalRow(struct, 1)
      println(row)
      scala> [[null], 1]
      struct.setInt(0, 42)
      println(row)
      scala> [[42], 1]
      ```
      
      This might be somewhat controversial, so feedback is appreciated.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #15333 from hvanhovell/SPARK-17761.
      97594c29
  7. Oct 06, 2016
    • sethah's avatar
      [SPARK-17792][ML] L-BFGS solver for linear regression does not accept general... · 3713bb19
      sethah authored
      [SPARK-17792][ML] L-BFGS solver for linear regression does not accept general numeric label column types
      
      ## What changes were proposed in this pull request?
      
      Before, we computed `instances` in LinearRegression in two spots, even though they did the same thing. One of them did not cast the label column to `DoubleType`. This patch consolidates the computation and always casts the label column to `DoubleType`.
      
      ## How was this patch tested?
      
      Added a unit test to check all solvers. This test failed before this patch.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15364 from sethah/linreg_numeric_type.
      3713bb19
    • Yanbo Liang's avatar
      [MINOR][ML] Avoid 2D array flatten in NB training. · 7aeb20be
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Avoid 2D array flatten in ```NaiveBayes``` training, since flatten method might be expensive (It will create another array and copy data there).
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15359 from yanboliang/nb-theta.
      7aeb20be
  8. Oct 04, 2016
    • Zheng RuiFeng's avatar
      [SPARK-17744][ML] Parity check between the ml and mllib test suites for NB · c17f9718
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1,parity check and add missing test suites for ml's NB
      2,remove some unused imports
      
      ## How was this patch tested?
       manual tests in spark-shell
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15312 from zhengruifeng/nb_test_parity.
      c17f9718
    • ding's avatar
      [SPARK-17559][MLLIB] persist edges if their storage level is non in PeriodicGraphCheckpointer · 126baa8d
      ding authored
      ## What changes were proposed in this pull request?
      When use PeriodicGraphCheckpointer to persist graph, sometimes the edges isn't persisted. As currently only when vertices's storage level is none, graph is persisted. However there is a chance vertices's storage level is not none while edges's is none. Eg. graph created by a outerJoinVertices operation, vertices is automatically cached while edges is not. In this way, edges will not be persisted if we use PeriodicGraphCheckpointer do persist. We need separately check edges's storage level and persisted it if it's none.
      
      ## How was this patch tested?
       manual tests
      
      Author: ding <ding@localhost.localdomain>
      
      Closes #15124 from dding3/spark-persisitEdge.
      126baa8d
  9. Oct 01, 2016
  10. Sep 30, 2016
    • Zheng RuiFeng's avatar
      [SPARK-14077][ML][FOLLOW-UP] Revert change for NB Model's Load to maintain... · 8e491af5
      Zheng RuiFeng authored
      [SPARK-14077][ML][FOLLOW-UP] Revert change for NB Model's Load to maintain compatibility with the model stored before 2.0
      
      ## What changes were proposed in this pull request?
      Revert change for NB Model's Load to maintain compatibility with the model stored before 2.0
      
      ## How was this patch tested?
      local build
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15313 from zhengruifeng/revert_save_load.
      8e491af5
    • Zheng RuiFeng's avatar
      [SPARK-14077][ML] Refactor NaiveBayes to support weighted instances · 1fad5596
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1,support weighted data
      2,use dataset/dataframe instead of rdd
      3,make mllib as a wrapper to call ml
      
      ## How was this patch tested?
      local manual tests in spark-shell
      unit tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12819 from zhengruifeng/weighted_nb.
      1fad5596
  11. Sep 29, 2016
    • Bryan Cutler's avatar
      [SPARK-17697][ML] Fixed bug in summary calculations that pattern match against... · 2f739567
      Bryan Cutler authored
      [SPARK-17697][ML] Fixed bug in summary calculations that pattern match against label without casting
      
      ## What changes were proposed in this pull request?
      In calling LogisticRegression.evaluate and GeneralizedLinearRegression.evaluate using a Dataset where the Label is not of a double type, calculations pattern match against a double and throw a MatchError.  This fix casts the Label column to a DoubleType to ensure there is no MatchError.
      
      ## How was this patch tested?
      Added unit tests to call evaluate with a dataset that has Label as other numeric types.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #15288 from BryanCutler/binaryLOR-numericCheck-SPARK-17697.
      2f739567
    • Bjarne Fruergaard's avatar
      [SPARK-17721][MLLIB][ML] Fix for multiplying transposed SparseMatrix with SparseVector · 29396e7d
      Bjarne Fruergaard authored
      ## What changes were proposed in this pull request?
      
      * changes the implementation of gemv with transposed SparseMatrix and SparseVector both in mllib-local and mllib (identical)
      * adds a test that was failing before this change, but succeeds with these changes.
      
      The problem in the previous implementation was that it only increments `i`, that is enumerating the columns of a row in the SparseMatrix, when the row-index of the vector matches the column-index of the SparseMatrix. In cases where a particular row of the SparseMatrix has non-zero values at column-indices lower than corresponding non-zero row-indices of the SparseVector, the non-zero values of the SparseVector are enumerated without ever matching the column-index at index `i` and the remaining column-indices i+1,...,indEnd-1 are never attempted. The test cases in this PR illustrate this issue.
      
      ## How was this patch tested?
      
      I have run the specific `gemv` tests in both mllib-local and mllib. I am currently still running `./dev/run-tests`.
      
      ## ___
      As per instructions, I hereby state that this is my original work and that I license the work to the project (Apache Spark) under the project's open source license.
      
      Mentioning dbtsai, viirya and brkyvz whom I can see have worked/authored on these parts before.
      
      Author: Bjarne Fruergaard <bwahlgreen@gmail.com>
      
      Closes #15296 from bwahlgreen/bugfix-spark-17721.
      29396e7d
    • Yanbo Liang's avatar
      [SPARK-17704][ML][MLLIB] ChiSqSelector performance improvement. · f7082ac1
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Several performance improvement for ```ChiSqSelector```:
      1, Keep ```selectedFeatures``` ordered ascendent.
      ```ChiSqSelectorModel.transform``` need ```selectedFeatures``` ordered to make prediction. We should sort it when training model rather than making prediction, since users usually train model once and use the model to do prediction multiple times.
      2, When training ```fpr``` type ```ChiSqSelectorModel```, it's not necessary to sort the ChiSq test result by statistic.
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15277 from yanboliang/spark-17704.
      f7082ac1
    • Yanbo Liang's avatar
      [SPARK-16356][FOLLOW-UP][ML] Enforce ML test of exception for local/distributed Dataset. · a19a1bb5
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      #14035 added ```testImplicits``` to ML unit tests and promoted ```toDF()```, but left one minor issue at ```VectorIndexerSuite```. If we create the DataFrame by ```Seq(...).toDF()```, it will throw different error/exception compared with ```sc.parallelize(Seq(...)).toDF()``` for one of the test cases.
      After in-depth study, I found it was caused by different behavior of local and distributed Dataset if the UDF failed at ```assert```. If the data is local Dataset, it throws ```AssertionError``` directly; If the data is distributed Dataset, it throws ```SparkException``` which is the wrapper of ```AssertionError```. I think we should enforce this test to cover both case.
      
      ## How was this patch tested?
      Unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15261 from yanboliang/spark-16356.
      a19a1bb5
  12. Sep 27, 2016
    • Josh Rosen's avatar
      [SPARK-17666] Ensure that RecordReaders are closed by data source file scans · b03b4adf
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      This patch addresses a potential cause of resource leaks in data source file scans. As reported in [SPARK-17666](https://issues.apache.org/jira/browse/SPARK-17666), tasks which do not fully-consume their input may cause file handles / network connections (e.g. S3 connections) to be leaked. Spark's `NewHadoopRDD` uses a TaskContext callback to [close its record readers](https://github.com/apache/spark/blame/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L208), but the new data source file scans will only close record readers once their iterators are fully-consumed.
      
      This patch modifies `RecordReaderIterator` and `HadoopFileLinesReader` to add `close()` methods and modifies all six implementations of `FileFormat.buildReader()` to register TaskContext task completion callbacks to guarantee that cleanup is eventually performed.
      
      ## How was this patch tested?
      
      Tested manually for now.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #15245 from JoshRosen/SPARK-17666-close-recordreader.
      b03b4adf
    • Kazuaki Ishizaki's avatar
      [SPARK-15962][SQL] Introduce implementation with a dense format for UnsafeArrayData · 85b0a157
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR introduces more compact representation for ```UnsafeArrayData```.
      
      ```UnsafeArrayData``` needs to accept ```null``` value in each entry of an array. In the current version, it has three parts
      ```
      [numElements] [offsets] [values]
      ```
      `Offsets` has the number of `numElements`, and represents `null` if its value is negative. It may increase memory footprint, and introduces an indirection for accessing each of `values`.
      
      This PR uses bitvectors to represent nullability for each element like `UnsafeRow`, and eliminates an indirection for accessing each element. The new ```UnsafeArrayData``` has four parts.
      ```
      [numElements][null bits][values or offset&length][variable length portion]
      ```
      In the `null bits` region, we store 1 bit per element, represents whether an element is null. Its total size is ceil(numElements / 8) bytes, and it is aligned to 8-byte boundaries.
      In the `values or offset&length` region, we store the content of elements. For fields that hold fixed-length primitive types, such as long, double, or int, we store the value directly in the field. For fields with non-primitive or variable-length values, we store a relative offset (w.r.t. the base address of the array) that points to the beginning of the variable-length field and length (they are combined into a long). Each is word-aligned. For `variable length portion`, each is aligned to 8-byte boundaries.
      
      The new format can reduce memory footprint and improve performance of accessing each element. An example of memory foot comparison:
      1024x1024 elements integer array
      Size of ```baseObject``` for ```UnsafeArrayData```: 8 + 1024x1024 + 1024x1024 = 2M bytes
      Size of ```baseObject``` for ```UnsafeArrayData```: 8 + 1024x1024/8 + 1024x1024 = 1.25M bytes
      
      In summary, we got 1.0-2.6x performance improvements over the code before applying this PR.
      Here are performance results of [benchmark programs](https://github.com/kiszk/spark/blob/04d2e4b6dbdc4eff43ce18b3c9b776e0129257c7/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/UnsafeArrayDataBenchmark.scala):
      
      **Read UnsafeArrayData**: 1.7x and 1.6x performance improvements over the code before applying this PR
      ````
      OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.4.11-200.fc22.x86_64
      Intel Xeon E3-12xx v2 (Ivy Bridge)
      
      Without SPARK-15962
      Read UnsafeArrayData:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      Int                                            430 /  436        390.0           2.6       1.0X
      Double                                         456 /  485        367.8           2.7       0.9X
      
      With SPARK-15962
      Read UnsafeArrayData:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      Int                                            252 /  260        666.1           1.5       1.0X
      Double                                         281 /  292        597.7           1.7       0.9X
      ````
      **Write UnsafeArrayData**: 1.0x and 1.1x performance improvements over the code before applying this PR
      ````
      OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.0.4-301.fc22.x86_64
      Intel Xeon E3-12xx v2 (Ivy Bridge)
      
      Without SPARK-15962
      Write UnsafeArrayData:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      Int                                            203 /  273        103.4           9.7       1.0X
      Double                                         239 /  356         87.9          11.4       0.8X
      
      With SPARK-15962
      Write UnsafeArrayData:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      Int                                            196 /  249        107.0           9.3       1.0X
      Double                                         227 /  367         92.3          10.8       0.9X
      ````
      
      **Get primitive array from UnsafeArrayData**: 2.6x and 1.6x performance improvements over the code before applying this PR
      ````
      OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.0.4-301.fc22.x86_64
      Intel Xeon E3-12xx v2 (Ivy Bridge)
      
      Without SPARK-15962
      Get primitive array from UnsafeArrayData: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      Int                                            207 /  217        304.2           3.3       1.0X
      Double                                         257 /  363        245.2           4.1       0.8X
      
      With SPARK-15962
      Get primitive array from UnsafeArrayData: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      Int                                            151 /  198        415.8           2.4       1.0X
      Double                                         214 /  394        293.6           3.4       0.7X
      ````
      
      **Create UnsafeArrayData from primitive array**: 1.7x and 2.1x performance improvements over the code before applying this PR
      ````
      OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.0.4-301.fc22.x86_64
      Intel Xeon E3-12xx v2 (Ivy Bridge)
      
      Without SPARK-15962
      Create UnsafeArrayData from primitive array: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      Int                                            340 /  385        185.1           5.4       1.0X
      Double                                         479 /  705        131.3           7.6       0.7X
      
      With SPARK-15962
      Create UnsafeArrayData from primitive array: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      Int                                            206 /  211        306.0           3.3       1.0X
      Double                                         232 /  406        271.6           3.7       0.9X
      ````
      
      1.7x and 1.4x performance improvements in [```UDTSerializationBenchmark```](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.scala)  over the code before applying this PR
      ````
      OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.4.11-200.fc22.x86_64
      Intel Xeon E3-12xx v2 (Ivy Bridge)
      
      Without SPARK-15962
      VectorUDT de/serialization:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      serialize                                      442 /  533          0.0      441927.1       1.0X
      deserialize                                    217 /  274          0.0      217087.6       2.0X
      
      With SPARK-15962
      VectorUDT de/serialization:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      serialize                                      265 /  318          0.0      265138.5       1.0X
      deserialize                                    155 /  197          0.0      154611.4       1.7X
      ````
      
      ## How was this patch tested?
      
      Added unit tests into ```UnsafeArraySuite```
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #13680 from kiszk/SPARK-15962.
      85b0a157
  13. Sep 26, 2016
    • hyukjinkwon's avatar
      [SPARK-16356][ML] Add testImplicits for ML unit tests and promote toDF() · f234b7cd
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This was suggested in https://github.com/apache/spark/commit/101663f1ae222a919fc40510aa4f2bad22d1be6f#commitcomment-17114968.
      
      This PR adds `testImplicits` to `MLlibTestSparkContext` so that some implicits such as `toDF()` can be sued across ml tests.
      
      This PR also changes all the usages of `spark.createDataFrame( ... )` to `toDF()` where applicable in ml tests in Scala.
      
      ## How was this patch tested?
      
      Existing tests should work.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14035 from HyukjinKwon/minor-ml-test.
      f234b7cd
    • Yanbo Liang's avatar
      [SPARK-17017][FOLLOW-UP][ML] Refactor of ChiSqSelector and add ML Python API. · ac65139b
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      #14597 modified ```ChiSqSelector``` to support ```fpr``` type selector, however, it left some issue need to be addressed:
      * We should allow users to set selector type explicitly rather than switching them by using different setting function, since the setting order will involves some unexpected issue. For example, if users both set ```numTopFeatures``` and ```percentile```, it will train ```kbest``` or ```percentile``` model based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly. We handle similar issues at other place of ML code base such as ```GeneralizedLinearRegression``` and ```LogisticRegression```.
      * Meanwhile, if there are more than one parameter except ```alpha``` can be set for ```fpr``` model, we can not handle it elegantly in the existing framework. And similar issues for ```kbest``` and ```percentile``` model. Setting selector type explicitly can solve this issue also.
      * If setting selector type explicitly by users is allowed, we should handle param interaction such as if users set ```selectorType = percentile``` and ```alpha = 0.1```, we should notify users the parameter ```alpha``` will take no effect. We should handle complex parameter interaction checks at ```transformSchema```. (FYI #11620)
      * We should use lower case of the selector type names to follow MLlib convention.
      * Add ML Python API.
      
      ## How was this patch tested?
      Unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15214 from yanboliang/spark-17017.
      Unverified
      ac65139b
  14. Sep 24, 2016
  15. Sep 23, 2016
    • WeichenXu's avatar
      [SPARK-17499][SPARKR][ML][MLLIB] make the default params in sparkR spark.mlp... · f89808b0
      WeichenXu authored
      [SPARK-17499][SPARKR][ML][MLLIB] make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier
      
      ## What changes were proposed in this pull request?
      
      update `MultilayerPerceptronClassifierWrapper.fit` paramter type:
      `layers: Array[Int]`
      `seed: String`
      
      update several default params in sparkR `spark.mlp`:
      `tol` --> 1e-6
      `stepSize` --> 0.03
      `seed` --> NULL ( when seed == NULL, the scala-side wrapper regard it as a `null` value and the seed will use the default one )
      r-side `seed` only support 32bit integer.
      
      remove `layers` default value, and move it in front of those parameters with default value.
      add `layers` parameter validation check.
      
      ## How was this patch tested?
      
      tests added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #15051 from WeichenXu123/update_py_mlp_default.
      f89808b0
    • Joseph K. Bradley's avatar
      [SPARK-16719][ML] Random Forests should communicate fewer trees on each iteration · 947b8c6e
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      RandomForest currently sends the entire forest to each worker on each iteration. This is because (a) the node queue is FIFO and (b) the closure references the entire array of trees (topNodes). (a) causes RFs to handle splits in many trees, especially early on in learning. (b) sends all trees explicitly.
      
      This PR:
      (a) Change the RF node queue to be FILO (a stack), so that RFs tend to focus on 1 or a few trees before focusing on others.
      (b) Change topNodes to pass only the trees required on that iteration.
      
      ## How was this patch tested?
      
      Unit tests:
      * Existing tests for correctness of tree learning
      * Manually modifying code and running tests to verify that a small number of trees are communicated on each iteration
        * This last item is hard to test via unit tests given the current APIs.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #14359 from jkbradley/rfs-fewer-trees.
      947b8c6e
  16. Sep 22, 2016
    • Gayathri Murali's avatar
      [SPARK-16240][ML] ML persistence backward compatibility for LDA · f4f6bd8c
      Gayathri Murali authored
      ## What changes were proposed in this pull request?
      
      Allow Spark 2.x to load instances of LDA, LocalLDAModel, and DistributedLDAModel saved from Spark 1.6.
      
      ## How was this patch tested?
      
      I tested this manually, saving the 3 types from 1.6 and loading them into master (2.x).  In the future, we can add generic tests for testing backwards compatibility across all ML models in SPARK-15573.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #15034 from jkbradley/lda-backwards.
      f4f6bd8c
    • WeichenXu's avatar
      [SPARK-17281][ML][MLLIB] Add treeAggregateDepth parameter for AFTSurvivalRegression · 72d9fba2
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      Add treeAggregateDepth parameter for AFTSurvivalRegression to keep consistent with LiR/LoR.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14851 from WeichenXu123/add_treeAggregate_param_for_survival_regression.
      72d9fba2
  17. Sep 21, 2016
    • Sean Owen's avatar
      [SPARK-11918][ML] Better error from WLS for cases like singular input · b4a4421b
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Update error handling for Cholesky decomposition to provide a little more info when input is singular.
      
      ## How was this patch tested?
      
      New test case; jenkins tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15177 from srowen/SPARK-11918.
      b4a4421b
    • VinceShieh's avatar
      [SPARK-17219][ML] Add NaN value handling in Bucketizer · 57dc326b
      VinceShieh authored
      ## What changes were proposed in this pull request?
      This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value.
      Sometimes, null value might also be useful to users, so in these cases, Bucketizer should
      reserve one extra bucket for NaN values, instead of throwing an illegal exception.
      Before:
      ```
      Bucketizer.transform on NaN value threw an illegal exception.
      ```
      After:
      ```
      NaN values will be grouped in an extra bucket.
      ```
      ## How was this patch tested?
      New test cases added in `BucketizerSuite`.
      Signed-off-by: VinceShieh <vincent.xieintel.com>
      
      Author: VinceShieh <vincent.xie@intel.com>
      
      Closes #14858 from VinceShieh/spark-17219.
      Unverified
      57dc326b
    • Peng, Meng's avatar
      [SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate (FPR) test · b366f184
      Peng, Meng authored
      ## What changes were proposed in this pull request?
      
      Univariate feature selection works by selecting the best features based on univariate statistical tests. False Positive Rate (FPR) is a popular univariate statistical test for feature selection. We add a chiSquare Selector based on False Positive Rate (FPR) test in this PR, like it is implemented in scikit-learn.
      http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
      
      ## How was this patch tested?
      
      Add Scala ut
      
      Author: Peng, Meng <peng.meng@intel.com>
      
      Closes #14597 from mpjlu/fprChiSquare.
      Unverified
      b366f184
    • William Benton's avatar
      [SPARK-17595][MLLIB] Use a bounded priority queue to find synonyms in Word2VecModel · 7654385f
      William Benton authored
      ## What changes were proposed in this pull request?
      
      The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements with the highest similarity to the query vector currently sorts the collection of similarities for every vocabulary element. This involves making multiple copies of the collection of similarities while doing a (relatively) expensive sort. It would be more efficient to find the best matches by maintaining a bounded priority queue and populating it with a single pass over the vocabulary, and that is exactly what this patch does.
      
      ## How was this patch tested?
      
      This patch adds no user-visible functionality and its correctness should be exercised by existing tests.  To ensure that this approach is actually faster, I made a microbenchmark for `findSynonyms`:
      
      ```
      object W2VTiming {
        import org.apache.spark.{SparkContext, SparkConf}
        import org.apache.spark.mllib.feature.Word2VecModel
        def run(modelPath: String, scOpt: Option[SparkContext] = None) {
          val sc = scOpt.getOrElse(new SparkContext(new SparkConf(true).setMaster("local[*]").setAppName("test")))
          val model = Word2VecModel.load(sc, modelPath)
          val keys = model.getVectors.keys
          val start = System.currentTimeMillis
          for(key <- keys) {
            model.findSynonyms(key, 5)
            model.findSynonyms(key, 10)
            model.findSynonyms(key, 25)
            model.findSynonyms(key, 50)
          }
          val finish = System.currentTimeMillis
          println("run completed in " + (finish - start) + "ms")
        }
      }
      ```
      
      I ran this test on a model generated from the complete works of Jane Austen and found that the new approach was over 3x faster than the old approach.  (If the `num` argument to `findSynonyms` is very close to the vocabulary size, the new approach will have less of an advantage over the old one.)
      
      Author: William Benton <willb@redhat.com>
      
      Closes #15150 from willb/SPARK-17595.
      Unverified
      7654385f
  18. Sep 19, 2016
    • sethah's avatar
      [SPARK-17163][ML] Unified LogisticRegression interface · 26145a5a
      sethah authored
      ## What changes were proposed in this pull request?
      
      Merge `MultinomialLogisticRegression` into `LogisticRegression` and remove `MultinomialLogisticRegression`.
      
      Marked as WIP because we should discuss the coefficients API in the model. See discussion below.
      
      JIRA: [SPARK-17163](https://issues.apache.org/jira/browse/SPARK-17163)
      
      ## How was this patch tested?
      
      Merged test suites and added some new unit tests.
      
      ## Design
      
      ### Switching between binomial and multinomial
      
      We default to automatically detecting whether we should run binomial or multinomial lor. We expose a new parameter called `family` which defaults to auto. When "auto" is used, we run normal binomial lor with pivoting if there are 1 or 2 label classes. Otherwise, we run multinomial. If the user explicitly sets the family, then we abide by that setting. In the case where "binomial" is set but multiclass lor is detected, we throw an error.
      
      ### coefficients/intercept model API (TODO)
      
      This is the biggest design point remaining, IMO. We need to decide how to store the coefficients and intercepts in the model, and in turn how to expose them via the API. Two important points:
      
      * We must maintain compatibility with the old API, i.e. we must expose `def coefficients: Vector` and `def intercept: Double`
      * There are two separate cases: binomial lr where we have a single set of coefficients and a single intercept and multinomial lr where we have `numClasses` sets of coefficients and `numClasses` intercepts.
      
      Some options:
      
      1. **Store the binomial coefficients as a `2 x numFeatures` matrix.** This means that we would center the model coefficients before storing them in the model. The BLOR algorithm gives `1 * numFeatures` coefficients, but we would convert them to `2 x numFeatures` coefficients before storing them, effectively doubling the storage in the model. This has the advantage that we can make the code cleaner (i.e. less `if (isMultinomial) ... else ...`) and we don't have to reason about the different cases as much. It has the disadvantage that we double the storage space and we could see small regressions at prediction time since there are 2x the number of operations in the prediction algorithms. Additionally, we still have to produce the uncentered coefficients/intercept via the API, so we will have to either ALSO store the uncentered version, or compute it in `def coefficients: Vector` every time.
      
      2. **Store the binomial coefficients as a `1 x numFeatures` matrix.** We still store the coefficients as a matrix and the intercepts as a vector. When users call `coefficients` we return them a `Vector` that is backed by the same underlying array as the `coefficientMatrix`, so we don't duplicate any data. At prediction time, we use the old prediction methods that are specialized for binary LOR. The benefits here are that we don't store extra data, and we won't see any regressions in performance. The cost of this is that we have separate implementations for predict methods in the binary vs multiclass case. The duplicated code is really not very high, but it's still a bit messy.
      
      If we do decide to store the 2x coefficients, we would likely want to see some performance tests to understand the potential regressions.
      
      **Update:** We have chosen option 2
      
      ### Threshold/thresholds (TODO)
      
      Currently, when `threshold` is set we clear whatever value is in `thresholds` and when `thresholds` is set we clear whatever value is in `threshold`. [SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543) was created to prefer thresholds over threshold. We should decide if we should implement this behavior now or if we want to do it in a separate JIRA.
      
      **Update:** Let's leave it for a follow up PR
      
      ## Follow up
      
      * Summary model for multiclass logistic regression [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139)
      * Thresholds vs threshold [SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543)
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #14834 from sethah/SPARK-17163.
      26145a5a
  19. Sep 17, 2016
    • William Benton's avatar
      [SPARK-17548][MLLIB] Word2VecModel.findSynonyms no longer spuriously rejects... · 25cbbe6c
      William Benton authored
      [SPARK-17548][MLLIB] Word2VecModel.findSynonyms no longer spuriously rejects the best match when invoked with a vector
      
      ## What changes were proposed in this pull request?
      
      This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary.  Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector.
      
      ## How was this patch tested?
      
      I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected.
      
      Author: William Benton <willb@redhat.com>
      
      Closes #15105 from willb/fix/findSynonyms.
      Unverified
      25cbbe6c
  20. Sep 15, 2016
    • WeichenXu's avatar
      [SPARK-17507][ML][MLLIB] check weight vector size in ANN · d15b4f90
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      as the TODO described,
      check weight vector size and if wrong throw exception.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #15060 from WeichenXu123/check_input_weight_size_of_ann.
      d15b4f90
  21. Sep 11, 2016
Loading