Skip to content
Snippets Groups Projects
  1. Mar 09, 2016
    • sethah's avatar
      [SPARK-11861][ML] Add feature importances for decision trees · e1772d3f
      sethah authored
      This patch adds an API entry point for single decision tree feature importances.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #9912 from sethah/SPARK-11861.
      e1772d3f
    • Yanbo Liang's avatar
      [SPARK-13615][ML] GeneralizedLinearRegression supports save/load · 0dd06485
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      ```GeneralizedLinearRegression``` supports ```save/load```.
      cc mengxr
      ## How was this patch tested?
      unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11465 from yanboliang/spark-13615.
      0dd06485
    • Dongjoon Hyun's avatar
      [SPARK-13702][CORE][SQL][MLLIB] Use diamond operator for generic instance creation in Java code. · c3689bc2
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator.
      
      ```
      -    final ArrayList<Product2<Object, Object>> dataToWrite =
      -      new ArrayList<Product2<Object, Object>>();
      +    final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>();
      ```
      
      Java 7 or higher supports **diamond** operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this.
      
      ## How was this patch tested?
      
      Manual.
      Pass the existing tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11541 from dongjoon-hyun/SPARK-13702.
      c3689bc2
  2. Mar 08, 2016
    • Yanbo Liang's avatar
      [ML] testEstimatorAndModelReadWrite should call checkModelData · 9740954f
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Although we defined ```checkModelData``` in [```read/write``` test](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L994) of ML estimators/models and pass it to ```testEstimatorAndModelReadWrite```, ```testEstimatorAndModelReadWrite``` omits to call ```checkModelData``` to check the equality of model data. So actually we did not run the check of model data equality for all test cases currently, we should fix it.
      BTW, fix the bug of LDA read/write test which did not set ```docConcentration```. This bug should have failed test, but it does not complain because we did not run ```checkModelData``` actually.
      cc jkbradley mengxr
      ## How was this patch tested?
      No new unit test, should pass the exist ones.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11513 from yanboliang/ml-check-model-data.
      9740954f
    • Sean Owen's avatar
      [SPARK-13715][MLLIB] Remove last usages of jblas in tests · 54040f8d
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Remove last usage of jblas, in tests
      
      ## How was this patch tested?
      
      Jenkins tests -- the same ones that are being modified.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11560 from srowen/SPARK-13715.
      54040f8d
  3. Mar 07, 2016
    • Michael Armbrust's avatar
      [SPARK-13665][SQL] Separate the concerns of HadoopFsRelation · e720dda4
      Michael Armbrust authored
      `HadoopFsRelation` is used for reading most files into Spark SQL.  However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data.  As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency.  This PR is a first cut at separating this into several components / interfaces that are each described below.  Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new API `FileFormat`.  External libraries, such as spark-avro will also need to be ported to work with Spark 2.0.
      
      ### HadoopFsRelation
      A simple `case class` that acts as a container for all of the metadata required to read from a datasource.  All discovery, resolution and merging logic for schemas and partitions has been removed.  This an internal representation that no longer needs to be exposed to developers.
      
      ```scala
      case class HadoopFsRelation(
          sqlContext: SQLContext,
          location: FileCatalog,
          partitionSchema: StructType,
          dataSchema: StructType,
          bucketSpec: Option[BucketSpec],
          fileFormat: FileFormat,
          options: Map[String, String]) extends BaseRelation
      ```
      
      ### FileFormat
      The primary interface that will be implemented by each different format including external libraries.  Implementors are responsible for reading a given format and converting it into `InternalRow` as well as writing out an `InternalRow`.  A format can optionally return a schema that is inferred from a set of files.
      
      ```scala
      trait FileFormat {
        def inferSchema(
            sqlContext: SQLContext,
            options: Map[String, String],
            files: Seq[FileStatus]): Option[StructType]
      
        def prepareWrite(
            sqlContext: SQLContext,
            job: Job,
            options: Map[String, String],
            dataSchema: StructType): OutputWriterFactory
      
        def buildInternalScan(
            sqlContext: SQLContext,
            dataSchema: StructType,
            requiredColumns: Array[String],
            filters: Array[Filter],
            bucketSet: Option[BitSet],
            inputFiles: Array[FileStatus],
            broadcastedConf: Broadcast[SerializableConfiguration],
            options: Map[String, String]): RDD[InternalRow]
      }
      ```
      
      The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e. `bucketSet` is passed down to the scan instead of being resolved by the planner).  Additionally, scans are still returning `RDD`s instead of iterators for single files.  In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file.
      
      ### FileCatalog
      This interface is used to list the files that make up a given relation, as well as handle directory based partitioning.
      
      ```scala
      trait FileCatalog {
        def paths: Seq[Path]
        def partitionSpec(schema: Option[StructType]): PartitionSpec
        def allFiles(): Seq[FileStatus]
        def getStatus(path: Path): Array[FileStatus]
        def refresh(): Unit
      }
      ```
      
      Currently there are two implementations:
       - `HDFSFileCatalog` - based on code from the old `HadoopFsRelation`.  Infers partitioning by recursive listing and caches this data for performance
       - `HiveFileCatalog` - based on the above, but it uses the partition spec from the Hive Metastore.
      
      ### ResolvedDataSource
      Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore):
       - `paths: Seq[String] = Nil`
       - `userSpecifiedSchema: Option[StructType] = None`
       - `partitionColumns: Array[String] = Array.empty`
       - `bucketSpec: Option[BucketSpec] = None`
       - `provider: String`
       - `options: Map[String, String]`
      
      This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones).  All reconciliation of partitions, buckets, schema from metastores or inference is done here.
      
      ### DataSourceAnalysis / DataSourceStrategy
      Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including:
       - pruning the files from partitions that will be read based on filters.
       - appending partition columns*
       - applying additional filters when a data source can not evaluate them internally.
       - constructing an RDD that is bucketed correctly when required*
       - sanity checking schema match-up and other analysis when writing.
      
      *In the future we should do that following:
       - Break out file handling into its own Strategy as its sufficiently complex / isolated.
       - Push the appending of partition columns down in to `FileFormat` to avoid an extra copy / unvectorization.
       - Use a custom RDD for scans instead of `SQLNewNewHadoopRDD2`
      
      Author: Michael Armbrust <michael@databricks.com>
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11509 from marmbrus/fileDataSource.
      e720dda4
  4. Mar 04, 2016
  5. Mar 03, 2016
    • Dongjoon Hyun's avatar
      [MINOR] Fix typos in comments and testcase name of code · 941b270b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes typos in comments and testcase name of code.
      
      ## How was this patch tested?
      
      manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.
      941b270b
    • Yanbo Liang's avatar
      [MINOR][ML][DOC] Remove duplicated periods at the end of some sharedParam · ce58e99a
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Remove duplicated periods at the end of some sharedParams in ScalaDoc, such as [here](https://github.com/apache/spark/pull/11344/files#diff-9edc669edcf2c0c7cf1efe4a0a57da80L367)
      cc mengxr srowen
      ## How was this patch tested?
      Documents change, no test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11344 from yanboliang/shared-cleanup.
      ce58e99a
    • Dongjoon Hyun's avatar
      [SPARK-13583][CORE][STREAMING] Remove unused imports and add checkstyle rule · b5f02d67
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time.
      This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers.
      
      ## How was this patch tested?
      ```
      ./dev/lint-java
      ./build/sbt compile
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11438 from dongjoon-hyun/SPARK-13583.
      b5f02d67
    • Sean Owen's avatar
      [SPARK-13423][WIP][CORE][SQL][STREAMING] Static analysis fixes for 2.x · e97fc7f1
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly:
      
      - Inner class should be static
      - Mismatched hashCode/equals
      - Overflow in compareTo
      - Unchecked warnings
      - Misuse of assert, vs junit.assert
      - get(a) + getOrElse(b) -> getOrElse(a,b)
      - Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions
      - Dead code
      - tailrec
      - exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count
      - reduce(_+_) -> sum map + flatten -> map
      
      The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places.
      
      ## How was the this patch tested?
      
      Existing Jenkins unit tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11292 from srowen/SPARK-13423.
      e97fc7f1
  6. Mar 01, 2016
  7. Feb 29, 2016
    • Zheng RuiFeng's avatar
      [SPARK-13506][MLLIB] Fix the wrong parameter in R code comment in AssociationRulesSuite · ac5c6352
      Zheng RuiFeng authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13506
      
      ## What changes were proposed in this pull request?
      
      just chang R Snippet Comment in  AssociationRulesSuite
      
      ## How was this patch tested?
      
      unit test passsed
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #11387 from zhengruifeng/ars.
      ac5c6352
    • Yanbo Liang's avatar
      [SPARK-13545][MLLIB][PYSPARK] Make MLlib LogisticRegressionWithLBFGS's default... · d81a7135
      Yanbo Liang authored
      [SPARK-13545][MLLIB][PYSPARK] Make MLlib LogisticRegressionWithLBFGS's default parameters consistent in Scala and Python
      
      ## What changes were proposed in this pull request?
      * The default value of ```regParam``` of PySpark MLlib ```LogisticRegressionWithLBFGS``` should be consistent with Scala which is ```0.0```. (This is also consistent with ML ```LogisticRegression```.)
      * BTW, if we use a known updater(L1 or L2) for binary classification, ```LogisticRegressionWithLBFGS``` will call the ML implementation. We should update the API doc to clarifying ```numCorrections``` will have no effect if we fall into that route.
      * Make a pass for all parameters of ```LogisticRegressionWithLBFGS```, others are set properly.
      
      cc mengxr dbtsai
      ## How was this patch tested?
      No new tests, it should pass all current tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11424 from yanboliang/spark-13545.
      d81a7135
  8. Feb 26, 2016
    • Bryan Cutler's avatar
      [SPARK-12634][PYSPARK][DOC] PySpark tree parameter desc to consistent format · b33261f9
      Bryan Cutler authored
      Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent.  This is for the tree module.
      
      closes #10601
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      Author: vijaykiran <mail@vijaykiran.com>
      
      Closes #11353 from BryanCutler/param-desc-consistent-tree-SPARK-12634.
      b33261f9
    • Cheng Lian's avatar
      [SPARK-13457][SQL] Removes DataFrame RDD operations · 99dfcedb
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This is another try of PR #11323.
      
      This PR removes DataFrame RDD operations except for `foreach` and `foreachPartitions` (they are actions rather than transformations). Original calls are now replaced by calls to methods of `DataFrame.rdd`.
      
      PR #11323 was reverted because it introduced a regression: both `DataFrame.foreach` and `DataFrame.foreachPartitions` wrap underlying RDD operations with `withNewExecutionId` to track Spark jobs. But they are removed in #11323.
      
      ## How was the this patch tested?
      
      No extra tests are added. Existing tests should do the work.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #11388 from liancheng/remove-df-rdd-ops.
      99dfcedb
  9. Feb 25, 2016
    • Yuhao Yang's avatar
      [SPARK-13028] [ML] Add MaxAbsScaler to ML.feature as a transformer · 90d07154
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-13028
      MaxAbsScaler works in a very similar way as MinMaxScaler, but scales in a way that the training data lies within the range [-1, 1] by dividing through the largest maximum value in each feature. The motivation to use this scaling includes robustness to very small standard deviations of features and preserving zero entries in sparse data.
      
      Unlike StandardScaler and MinMaxScaler, MaxAbsScaler does not shift/center the data, and thus does not destroy any sparsity.
      
      Something similar from sklearn:
      http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #10939 from hhbyyh/maxabs and squashes the following commits:
      
      fd8bdcd [Yuhao Yang] add tag and some optimization on fit
      648fced [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs
      75bebc2 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs
      cb10bb6 [Yuhao Yang] remove minmax
      91ef8f3 [Yuhao Yang] ut added
      8ab0747 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs
      a9215b5 [Yuhao Yang] max abs scaler
      90d07154
    • Yu ISHIKAWA's avatar
      [SPARK-12874][ML] ML StringIndexer does not protect itself from column name duplication · 14e2700d
      Yu ISHIKAWA authored
      ## What changes were proposed in this pull request?
      ML StringIndexer does not protect itself from column name duplication.
      
      We should still improve a way to validate a schema of `StringIndexer` and `StringIndexerModel`.  However, it would be great to fix at another issue.
      
      ## How was this patch tested?
      unit test
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #11370 from yu-iskw/SPARK-12874.
      14e2700d
    • Davies Liu's avatar
      Revert "[SPARK-13457][SQL] Removes DataFrame RDD operations" · 751724b1
      Davies Liu authored
      This reverts commit 157fe64f.
      751724b1
    • Cheng Lian's avatar
      [SPARK-13457][SQL] Removes DataFrame RDD operations · 157fe64f
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR removes DataFrame RDD operations. Original calls are now replaced by calls to methods of `DataFrame.rdd`.
      
      ## How was the this patch tested?
      
      No extra tests are added. Existing tests should do the work.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #11323 from liancheng/remove-df-rdd-ops.
      157fe64f
    • Yanbo Liang's avatar
      [SPARK-13490][ML] ML LinearRegression should cache standardization param value · 4460113d
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Like #11027 for ```LogisticRegression```, ```LinearRegression``` with L1 regularization should also cache the value of the ```standardization``` rather than re-fetching it from the ```ParamMap``` for every OWLQN iteration.
      cc srowen
      
      ## How was this patch tested?
      No extra tests are added. It should pass all existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11367 from yanboliang/spark-13490.
      4460113d
    • Oliver Pierson's avatar
      [SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large DataFrames · 6f8e835c
      Oliver Pierson authored
      ## What changes were proposed in this pull request?
      
      Change line 113 of QuantileDiscretizer.scala to
      
      `val requiredSamples = math.max(numBins * numBins, 10000.0)`
      
      so that `requiredSamples` is a `Double`.  This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count`
      
      ## How was the this patch tested?
      Manual tests.  I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected.
      
      Author: Oliver Pierson <ocp@gatech.edu>
      Author: Oliver Pierson <opierson@umd.edu>
      
      Closes #11319 from oliverpierson/SPARK-13444.
      6f8e835c
  10. Feb 23, 2016
  11. Feb 22, 2016
    • Narine Kokhlikyan's avatar
      [SPARK-13295][ ML, MLLIB ] AFTSurvivalRegression.AFTAggregator improvements -... · 33ef3aa7
      Narine Kokhlikyan authored
      [SPARK-13295][ ML, MLLIB ] AFTSurvivalRegression.AFTAggregator improvements - avoid creating new instances of arrays/vectors for each record
      
      As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: AFTPoint) method a new array is being created for intercept value and it is being concatenated
      with another array which contains the betas, the resulted Array is being converted into a Dense vector which in its turn is being converted into breeze vector.
      This is expensive and not necessarily beautiful.
      
      I've tried to solve above mentioned problem by simple algebraic decompositions - keeping and treating intercept independently.
      
      Please let me know what do you think and if you have any questions.
      
      Thanks,
      Narine
      
      Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
      
      Closes #11179 from NarineK/survivaloptim.
      33ef3aa7
    • Yanbo Liang's avatar
      [SPARK-13334][ML] ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should set parent · 40e6d40f
      Yanbo Liang authored
      ML ```KMeansModel / BisectingKMeansModel / QuantileDiscretizer``` should set parent.
      
      cc mengxr
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11214 from yanboliang/spark-13334.
      40e6d40f
    • Bryan Cutler's avatar
      [SPARK-12632][PYSPARK][DOC] PySpark fpm and als parameter desc to consistent format · e298ac91
      Bryan Cutler authored
      Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent.  This is for the fpm and recommendation modules.
      
      Closes #10602
      Closes #10897
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      Author: somideshmukh <somilde@us.ibm.com>
      
      Closes #11186 from BryanCutler/param-desc-consistent-fpmrecc-SPARK-12632.
      e298ac91
    • Dongjoon Hyun's avatar
      [MINOR][DOCS] Fix all typos in markdown files of `doc` and similar patterns in other comments · 024482bf
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR tries to fix all typos in all markdown files under `docs` module,
      and fixes similar typos in other comments, too.
      
      ## How was the this patch tested?
      
      manual tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11300 from dongjoon-hyun/minor_fix_typos.
      024482bf
    • Yong Gang Cao's avatar
      [SPARK-12153][SPARK-7617][MLLIB] add support of arbitrary length sentence and... · ef1047fc
      Yong Gang Cao authored
      [SPARK-12153][SPARK-7617][MLLIB] add support of arbitrary length sentence and other tuning for Word2Vec
      
      add support of arbitrary length sentence by using the nature representation of sentences in the input.
      
      add new similarity functions and add normalization option for distances in synonym finding
      add new accessor for internal structure(the vocabulary and wordindex) for convenience
      
      need instructions about how to set value for the Since annotation for newly added public functions. 1.5.3?
      
      jira link: https://issues.apache.org/jira/browse/SPARK-12153
      
      Author: Yong Gang Cao <ygcao@amazon.com>
      Author: Yong-Gang Cao <ygcao@users.noreply.github.com>
      
      Closes #10152 from ygcao/improvementForSentenceBoundary.
      ef1047fc
  12. Feb 21, 2016
  13. Feb 17, 2016
  14. Feb 16, 2016
  15. Feb 15, 2016
    • seddonm1's avatar
      [SPARK-13097][ML] Binarizer allowing Double AND Vector input types · cbeb006f
      seddonm1 authored
      This enhancement extends the existing SparkML Binarizer [SPARK-5891] to allow Vector in addition to the existing Double input column type.
      
      A use case for this enhancement is for when a user wants to Binarize many similar feature columns at once using the same threshold value (for example a binary threshold applied to many pixels in an image).
      
      This contribution is my original work and I license the work to the project under the project's open source license.
      
      viirya mengxr
      
      Author: seddonm1 <seddonm1@gmail.com>
      
      Closes #10976 from seddonm1/master.
      cbeb006f
  16. Feb 13, 2016
  17. Feb 11, 2016
Loading