Skip to content
Snippets Groups Projects
  1. Mar 15, 2016
  2. Mar 14, 2016
    • Michael Armbrust's avatar
      [SPARK-13664][SQL] Add a strategy for planning partitioned and bucketed scans of files · 17eec0a7
      Michael Armbrust authored
      This PR adds a new strategy, `FileSourceStrategy`, that can be used for planning scans of collections of files that might be partitioned or bucketed.
      
      Compared with the existing planning logic in `DataSourceStrategy` this version has the following desirable properties:
       - It removes the need to have `RDD`, `broadcastedHadoopConf` and other distributed concerns  in the public API of `org.apache.spark.sql.sources.FileFormat`
       - Partition column appending is delegated to the format to avoid an extra copy / devectorization when appending partition columns
       - It minimizes the amount of data that is shipped to each executor (i.e. it does not send the whole list of files to every worker in the form of a hadoop conf)
       - it natively supports bucketing files into partitions, and thus does not require coalescing / creating a `UnionRDD` with the correct partitioning.
       - Small files are automatically coalesced into fewer tasks using an approximate bin-packing algorithm.
      
      Currently only a testing source is planned / tested using this strategy.  In follow-up PRs we will port the existing formats to this API.
      
      A stub for `FileScanRDD` is also added, but most methods remain unimplemented.
      
      Other minor cleanups:
       - partition pruning is pushed into `FileCatalog` so both the new and old code paths can use this logic.  This will also allow future implementations to use indexes or other tricks (i.e. a MySQL metastore)
       - The partitions from the `FileCatalog` now propagate information about file sizes all the way up to the planner so we can intelligently spread files out.
       - `Array` -> `Seq` in some internal APIs to avoid unnecessary `toArray` calls
       - Rename `Partition` to `PartitionDirectory` to differentiate partitions used earlier in pruning from those where we have already enumerated the files and their sizes.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #11646 from marmbrus/fileStrategy.
      17eec0a7
    • Ehsan M.Kermani's avatar
      [SPARK-11826][MLLIB] Refactor add() and subtract() methods · 992142b8
      Ehsan M.Kermani authored
      srowen Could you please check this when you have time?
      
      Author: Ehsan M.Kermani <ehsanmo1367@gmail.com>
      
      Closes #9916 from ehsanmok/JIRA-11826.
      992142b8
    • Dongjoon Hyun's avatar
      [SPARK-13686][MLLIB][STREAMING] Add a constructor parameter `reqParam` to... · a48296f4
      Dongjoon Hyun authored
      [SPARK-13686][MLLIB][STREAMING] Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD
      
      ## What changes were proposed in this pull request?
      
      `LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not have `regParam` as their constructor arguments. They just depends on GradientDescent's default reqParam values.
      To be consistent with other algorithms, we had better add them. The same default value is used.
      
      ## How was this patch tested?
      
      Pass the existing unit test.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11527 from dongjoon-hyun/SPARK-13686.
      a48296f4
    • Dongjoon Hyun's avatar
      [MINOR][DOCS] Fix more typos in comments/strings. · acdf2197
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes 135 typos over 107 files:
      * 121 typos in comments
      * 11 typos in testcase name
      * 3 typos in log messages
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11689 from dongjoon-hyun/fix_more_typos.
      acdf2197
  3. Mar 13, 2016
    • Sean Owen's avatar
      [SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <->... · 18408528
      Sean Owen authored
      [SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <-> byte[] conversions (and remaining Coverity items)
      
      ## What changes were proposed in this pull request?
      
      - Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8
      - Same for `InputStreamReader` and `OutputStreamWriter` constructors
      - Standardizes on UTF-8 everywhere
      - Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`)
      - (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit https://github.com/srowen/spark/commit/1deecd8d9ca986d8adb1a42d315890ce5349d29c )
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11657 from srowen/SPARK-13823.
      18408528
  4. Mar 12, 2016
  5. Mar 10, 2016
    • Cheng Lian's avatar
      [SPARK-13244][SQL] Migrates DataFrame to Dataset · 1d542785
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`.
      
      Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`).
      
      There are several noticeable API changes related to those returning arrays:
      
      1.  `collect`/`take`
      
          -   Old APIs in class `DataFrame`:
      
              ```scala
              def collect(): Array[Row]
              def take(n: Int): Array[Row]
              ```
      
          -   New APIs in class `Dataset[T]`:
      
              ```scala
              def collect(): Array[T]
              def take(n: Int): Array[T]
      
              def collectRows(): Array[Row]
              def takeRows(n: Int): Array[Row]
              ```
      
          Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side.
      
          Normally, Java users may fall back to `collectAsList` and `takeAsList`.  The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here).
      
      1.  `randomSplit`
      
          -   Old APIs in class `DataFrame`:
      
              ```scala
              def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame]
              def randomSplit(weights: Array[Double]): Array[DataFrame]
              ```
      
          -   New APIs in class `Dataset[T]`:
      
              ```scala
              def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]
              def randomSplit(weights: Array[Double]): Array[Dataset[T]]
              ```
      
          Similar problem as above, but hasn't been addressed for Java API yet.  We can probably add `randomSplitAsList` to fix this one.
      
      1.  `groupBy`
      
          Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods.  To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`.
      
      Other noticeable changes:
      
      1.  Dataset always do eager analysis now
      
          We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure.  However, Dataset encoders requires eager analysi during Dataset construction.  To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures.  This plan is passed by `QueryExecution.assertAnalyzed`.
      
      ## How was this patch tested?
      
      Existing tests do the work.
      
      ## TODO
      
      - [ ] Fix all tests
      - [ ] Re-enable MiMA check
      - [ ] Update ScalaDoc (`since`, `group`, and example code)
      
      Author: Cheng Lian <lian@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      Author: Wenchen Fan <wenchen@databricks.com>
      Author: Cheng Lian <liancheng@users.noreply.github.com>
      
      Closes #11443 from liancheng/ds-to-df.
      1d542785
    • Dongjoon Hyun's avatar
      [SPARK-3854][BUILD] Scala style: require spaces before `{`. · 91fed8e9
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Since the opening curly brace, '{', has many usages as discussed in [SPARK-3854](https://issues.apache.org/jira/browse/SPARK-3854), this PR adds a ScalaStyle rule to prevent '){' pattern  for the following majority pattern and fixes the code accordingly. If we enforce this in ScalaStyle from now, it will improve the Scala code quality and reduce review time.
      ```
      // Correct:
      if (true) {
        println("Wow!")
      }
      
      // Incorrect:
      if (true){
         println("Wow!")
      }
      ```
      IntelliJ also shows new warnings based on this.
      
      ## How was this patch tested?
      
      Pass the Jenkins ScalaStyle test.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11637 from dongjoon-hyun/SPARK-3854.
      91fed8e9
    • sethah's avatar
      [SPARK-11108][ML] OneHotEncoder should support other numeric types · 9fe38aba
      sethah authored
      Adding support for other numeric types:
      
      * Integer
      * Short
      * Long
      * Float
      * Decimal
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #9777 from sethah/SPARK-11108.
      9fe38aba
  6. Mar 09, 2016
    • sethah's avatar
      [SPARK-11861][ML] Add feature importances for decision trees · e1772d3f
      sethah authored
      This patch adds an API entry point for single decision tree feature importances.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #9912 from sethah/SPARK-11861.
      e1772d3f
    • Yanbo Liang's avatar
      [SPARK-13615][ML] GeneralizedLinearRegression supports save/load · 0dd06485
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      ```GeneralizedLinearRegression``` supports ```save/load```.
      cc mengxr
      ## How was this patch tested?
      unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11465 from yanboliang/spark-13615.
      0dd06485
    • Dongjoon Hyun's avatar
      [SPARK-13702][CORE][SQL][MLLIB] Use diamond operator for generic instance creation in Java code. · c3689bc2
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator.
      
      ```
      -    final ArrayList<Product2<Object, Object>> dataToWrite =
      -      new ArrayList<Product2<Object, Object>>();
      +    final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>();
      ```
      
      Java 7 or higher supports **diamond** operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this.
      
      ## How was this patch tested?
      
      Manual.
      Pass the existing tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11541 from dongjoon-hyun/SPARK-13702.
      c3689bc2
  7. Mar 08, 2016
    • Yanbo Liang's avatar
      [ML] testEstimatorAndModelReadWrite should call checkModelData · 9740954f
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Although we defined ```checkModelData``` in [```read/write``` test](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L994) of ML estimators/models and pass it to ```testEstimatorAndModelReadWrite```, ```testEstimatorAndModelReadWrite``` omits to call ```checkModelData``` to check the equality of model data. So actually we did not run the check of model data equality for all test cases currently, we should fix it.
      BTW, fix the bug of LDA read/write test which did not set ```docConcentration```. This bug should have failed test, but it does not complain because we did not run ```checkModelData``` actually.
      cc jkbradley mengxr
      ## How was this patch tested?
      No new unit test, should pass the exist ones.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11513 from yanboliang/ml-check-model-data.
      9740954f
    • Sean Owen's avatar
      [SPARK-13715][MLLIB] Remove last usages of jblas in tests · 54040f8d
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Remove last usage of jblas, in tests
      
      ## How was this patch tested?
      
      Jenkins tests -- the same ones that are being modified.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11560 from srowen/SPARK-13715.
      54040f8d
  8. Mar 07, 2016
    • Michael Armbrust's avatar
      [SPARK-13665][SQL] Separate the concerns of HadoopFsRelation · e720dda4
      Michael Armbrust authored
      `HadoopFsRelation` is used for reading most files into Spark SQL.  However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data.  As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency.  This PR is a first cut at separating this into several components / interfaces that are each described below.  Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new API `FileFormat`.  External libraries, such as spark-avro will also need to be ported to work with Spark 2.0.
      
      ### HadoopFsRelation
      A simple `case class` that acts as a container for all of the metadata required to read from a datasource.  All discovery, resolution and merging logic for schemas and partitions has been removed.  This an internal representation that no longer needs to be exposed to developers.
      
      ```scala
      case class HadoopFsRelation(
          sqlContext: SQLContext,
          location: FileCatalog,
          partitionSchema: StructType,
          dataSchema: StructType,
          bucketSpec: Option[BucketSpec],
          fileFormat: FileFormat,
          options: Map[String, String]) extends BaseRelation
      ```
      
      ### FileFormat
      The primary interface that will be implemented by each different format including external libraries.  Implementors are responsible for reading a given format and converting it into `InternalRow` as well as writing out an `InternalRow`.  A format can optionally return a schema that is inferred from a set of files.
      
      ```scala
      trait FileFormat {
        def inferSchema(
            sqlContext: SQLContext,
            options: Map[String, String],
            files: Seq[FileStatus]): Option[StructType]
      
        def prepareWrite(
            sqlContext: SQLContext,
            job: Job,
            options: Map[String, String],
            dataSchema: StructType): OutputWriterFactory
      
        def buildInternalScan(
            sqlContext: SQLContext,
            dataSchema: StructType,
            requiredColumns: Array[String],
            filters: Array[Filter],
            bucketSet: Option[BitSet],
            inputFiles: Array[FileStatus],
            broadcastedConf: Broadcast[SerializableConfiguration],
            options: Map[String, String]): RDD[InternalRow]
      }
      ```
      
      The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e. `bucketSet` is passed down to the scan instead of being resolved by the planner).  Additionally, scans are still returning `RDD`s instead of iterators for single files.  In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file.
      
      ### FileCatalog
      This interface is used to list the files that make up a given relation, as well as handle directory based partitioning.
      
      ```scala
      trait FileCatalog {
        def paths: Seq[Path]
        def partitionSpec(schema: Option[StructType]): PartitionSpec
        def allFiles(): Seq[FileStatus]
        def getStatus(path: Path): Array[FileStatus]
        def refresh(): Unit
      }
      ```
      
      Currently there are two implementations:
       - `HDFSFileCatalog` - based on code from the old `HadoopFsRelation`.  Infers partitioning by recursive listing and caches this data for performance
       - `HiveFileCatalog` - based on the above, but it uses the partition spec from the Hive Metastore.
      
      ### ResolvedDataSource
      Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore):
       - `paths: Seq[String] = Nil`
       - `userSpecifiedSchema: Option[StructType] = None`
       - `partitionColumns: Array[String] = Array.empty`
       - `bucketSpec: Option[BucketSpec] = None`
       - `provider: String`
       - `options: Map[String, String]`
      
      This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones).  All reconciliation of partitions, buckets, schema from metastores or inference is done here.
      
      ### DataSourceAnalysis / DataSourceStrategy
      Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including:
       - pruning the files from partitions that will be read based on filters.
       - appending partition columns*
       - applying additional filters when a data source can not evaluate them internally.
       - constructing an RDD that is bucketed correctly when required*
       - sanity checking schema match-up and other analysis when writing.
      
      *In the future we should do that following:
       - Break out file handling into its own Strategy as its sufficiently complex / isolated.
       - Push the appending of partition columns down in to `FileFormat` to avoid an extra copy / unvectorization.
       - Use a custom RDD for scans instead of `SQLNewNewHadoopRDD2`
      
      Author: Michael Armbrust <michael@databricks.com>
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11509 from marmbrus/fileDataSource.
      e720dda4
  9. Mar 04, 2016
  10. Mar 03, 2016
    • Dongjoon Hyun's avatar
      [MINOR] Fix typos in comments and testcase name of code · 941b270b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes typos in comments and testcase name of code.
      
      ## How was this patch tested?
      
      manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.
      941b270b
    • Yanbo Liang's avatar
      [MINOR][ML][DOC] Remove duplicated periods at the end of some sharedParam · ce58e99a
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Remove duplicated periods at the end of some sharedParams in ScalaDoc, such as [here](https://github.com/apache/spark/pull/11344/files#diff-9edc669edcf2c0c7cf1efe4a0a57da80L367)
      cc mengxr srowen
      ## How was this patch tested?
      Documents change, no test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11344 from yanboliang/shared-cleanup.
      ce58e99a
    • Dongjoon Hyun's avatar
      [SPARK-13583][CORE][STREAMING] Remove unused imports and add checkstyle rule · b5f02d67
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time.
      This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers.
      
      ## How was this patch tested?
      ```
      ./dev/lint-java
      ./build/sbt compile
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11438 from dongjoon-hyun/SPARK-13583.
      b5f02d67
    • Sean Owen's avatar
      [SPARK-13423][WIP][CORE][SQL][STREAMING] Static analysis fixes for 2.x · e97fc7f1
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly:
      
      - Inner class should be static
      - Mismatched hashCode/equals
      - Overflow in compareTo
      - Unchecked warnings
      - Misuse of assert, vs junit.assert
      - get(a) + getOrElse(b) -> getOrElse(a,b)
      - Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions
      - Dead code
      - tailrec
      - exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count
      - reduce(_+_) -> sum map + flatten -> map
      
      The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places.
      
      ## How was the this patch tested?
      
      Existing Jenkins unit tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11292 from srowen/SPARK-13423.
      e97fc7f1
  11. Mar 01, 2016
  12. Feb 29, 2016
    • Zheng RuiFeng's avatar
      [SPARK-13506][MLLIB] Fix the wrong parameter in R code comment in AssociationRulesSuite · ac5c6352
      Zheng RuiFeng authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13506
      
      ## What changes were proposed in this pull request?
      
      just chang R Snippet Comment in  AssociationRulesSuite
      
      ## How was this patch tested?
      
      unit test passsed
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #11387 from zhengruifeng/ars.
      ac5c6352
    • Yanbo Liang's avatar
      [SPARK-13545][MLLIB][PYSPARK] Make MLlib LogisticRegressionWithLBFGS's default... · d81a7135
      Yanbo Liang authored
      [SPARK-13545][MLLIB][PYSPARK] Make MLlib LogisticRegressionWithLBFGS's default parameters consistent in Scala and Python
      
      ## What changes were proposed in this pull request?
      * The default value of ```regParam``` of PySpark MLlib ```LogisticRegressionWithLBFGS``` should be consistent with Scala which is ```0.0```. (This is also consistent with ML ```LogisticRegression```.)
      * BTW, if we use a known updater(L1 or L2) for binary classification, ```LogisticRegressionWithLBFGS``` will call the ML implementation. We should update the API doc to clarifying ```numCorrections``` will have no effect if we fall into that route.
      * Make a pass for all parameters of ```LogisticRegressionWithLBFGS```, others are set properly.
      
      cc mengxr dbtsai
      ## How was this patch tested?
      No new tests, it should pass all current tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11424 from yanboliang/spark-13545.
      d81a7135
  13. Feb 26, 2016
    • Bryan Cutler's avatar
      [SPARK-12634][PYSPARK][DOC] PySpark tree parameter desc to consistent format · b33261f9
      Bryan Cutler authored
      Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent.  This is for the tree module.
      
      closes #10601
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      Author: vijaykiran <mail@vijaykiran.com>
      
      Closes #11353 from BryanCutler/param-desc-consistent-tree-SPARK-12634.
      b33261f9
    • Cheng Lian's avatar
      [SPARK-13457][SQL] Removes DataFrame RDD operations · 99dfcedb
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This is another try of PR #11323.
      
      This PR removes DataFrame RDD operations except for `foreach` and `foreachPartitions` (they are actions rather than transformations). Original calls are now replaced by calls to methods of `DataFrame.rdd`.
      
      PR #11323 was reverted because it introduced a regression: both `DataFrame.foreach` and `DataFrame.foreachPartitions` wrap underlying RDD operations with `withNewExecutionId` to track Spark jobs. But they are removed in #11323.
      
      ## How was the this patch tested?
      
      No extra tests are added. Existing tests should do the work.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #11388 from liancheng/remove-df-rdd-ops.
      99dfcedb
  14. Feb 25, 2016
    • Yuhao Yang's avatar
      [SPARK-13028] [ML] Add MaxAbsScaler to ML.feature as a transformer · 90d07154
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-13028
      MaxAbsScaler works in a very similar way as MinMaxScaler, but scales in a way that the training data lies within the range [-1, 1] by dividing through the largest maximum value in each feature. The motivation to use this scaling includes robustness to very small standard deviations of features and preserving zero entries in sparse data.
      
      Unlike StandardScaler and MinMaxScaler, MaxAbsScaler does not shift/center the data, and thus does not destroy any sparsity.
      
      Something similar from sklearn:
      http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #10939 from hhbyyh/maxabs and squashes the following commits:
      
      fd8bdcd [Yuhao Yang] add tag and some optimization on fit
      648fced [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs
      75bebc2 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs
      cb10bb6 [Yuhao Yang] remove minmax
      91ef8f3 [Yuhao Yang] ut added
      8ab0747 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs
      a9215b5 [Yuhao Yang] max abs scaler
      90d07154
    • Yu ISHIKAWA's avatar
      [SPARK-12874][ML] ML StringIndexer does not protect itself from column name duplication · 14e2700d
      Yu ISHIKAWA authored
      ## What changes were proposed in this pull request?
      ML StringIndexer does not protect itself from column name duplication.
      
      We should still improve a way to validate a schema of `StringIndexer` and `StringIndexerModel`.  However, it would be great to fix at another issue.
      
      ## How was this patch tested?
      unit test
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #11370 from yu-iskw/SPARK-12874.
      14e2700d
    • Davies Liu's avatar
      Revert "[SPARK-13457][SQL] Removes DataFrame RDD operations" · 751724b1
      Davies Liu authored
      This reverts commit 157fe64f.
      751724b1
    • Cheng Lian's avatar
      [SPARK-13457][SQL] Removes DataFrame RDD operations · 157fe64f
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR removes DataFrame RDD operations. Original calls are now replaced by calls to methods of `DataFrame.rdd`.
      
      ## How was the this patch tested?
      
      No extra tests are added. Existing tests should do the work.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #11323 from liancheng/remove-df-rdd-ops.
      157fe64f
    • Yanbo Liang's avatar
      [SPARK-13490][ML] ML LinearRegression should cache standardization param value · 4460113d
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Like #11027 for ```LogisticRegression```, ```LinearRegression``` with L1 regularization should also cache the value of the ```standardization``` rather than re-fetching it from the ```ParamMap``` for every OWLQN iteration.
      cc srowen
      
      ## How was this patch tested?
      No extra tests are added. It should pass all existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11367 from yanboliang/spark-13490.
      4460113d
    • Oliver Pierson's avatar
      [SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large DataFrames · 6f8e835c
      Oliver Pierson authored
      ## What changes were proposed in this pull request?
      
      Change line 113 of QuantileDiscretizer.scala to
      
      `val requiredSamples = math.max(numBins * numBins, 10000.0)`
      
      so that `requiredSamples` is a `Double`.  This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count`
      
      ## How was the this patch tested?
      Manual tests.  I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected.
      
      Author: Oliver Pierson <ocp@gatech.edu>
      Author: Oliver Pierson <opierson@umd.edu>
      
      Closes #11319 from oliverpierson/SPARK-13444.
      6f8e835c
  15. Feb 23, 2016
  16. Feb 22, 2016
    • Narine Kokhlikyan's avatar
      [SPARK-13295][ ML, MLLIB ] AFTSurvivalRegression.AFTAggregator improvements -... · 33ef3aa7
      Narine Kokhlikyan authored
      [SPARK-13295][ ML, MLLIB ] AFTSurvivalRegression.AFTAggregator improvements - avoid creating new instances of arrays/vectors for each record
      
      As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: AFTPoint) method a new array is being created for intercept value and it is being concatenated
      with another array which contains the betas, the resulted Array is being converted into a Dense vector which in its turn is being converted into breeze vector.
      This is expensive and not necessarily beautiful.
      
      I've tried to solve above mentioned problem by simple algebraic decompositions - keeping and treating intercept independently.
      
      Please let me know what do you think and if you have any questions.
      
      Thanks,
      Narine
      
      Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
      
      Closes #11179 from NarineK/survivaloptim.
      33ef3aa7
    • Yanbo Liang's avatar
      [SPARK-13334][ML] ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should set parent · 40e6d40f
      Yanbo Liang authored
      ML ```KMeansModel / BisectingKMeansModel / QuantileDiscretizer``` should set parent.
      
      cc mengxr
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11214 from yanboliang/spark-13334.
      40e6d40f
Loading