Skip to content
Snippets Groups Projects
  1. Apr 12, 2016
    • Shixiong Zhu's avatar
      [SPARK-14579][SQL] Fix a race condition in StreamExecution.processAllAvailable · 768b3d62
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      There is a race condition in `StreamExecution.processAllAvailable`. Here is an execution order to reproduce it.
      
      | Time        |Thread 1           | MicroBatchThread  |
      |:-------------:|:-------------:|:-----:|
      | 1 | |  `dataAvailable in constructNextBatch` returns false  |
      | 2 | addData(newData)      |   |
      | 3 | `noNewData = false` in  processAllAvailable |  |
      | 4 | | noNewData = true |
      | 5 | `noNewData` is true so just return | |
      
      The root cause is that `checking dataAvailable and change noNewData to true` is not atomic. This PR puts these two actions into `synchronized` to make sure they are atomic.
      
      In addition, this PR also has the following changes:
      
      - Make `committedOffsets` and `availableOffsets` volatile to make sure they can be seen in other threads.
      - Copy the reference of `availableOffsets` to a local variable so that `sourceStatuses` can use a snapshot of `availableOffsets`.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #12339 from zsxwing/race-condition.
      768b3d62
    • Davies Liu's avatar
      [SPARK-14578] [SQL] Fix codegen for CreateExternalRow with nested wide schema · 372baf04
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      The wide schema, the expression of fields will be splitted into multiple functions, but the variable for loopVar can't be accessed in splitted functions, this PR change them as class member.
      
      ## How was this patch tested?
      
      Added regression test.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12338 from davies/nested_row.
      372baf04
    • Sital Kedia's avatar
      [SPARK-14363] Fix executor OOM due to memory leak in the Sorter · d187e7de
      Sital Kedia authored
      ## What changes were proposed in this pull request?
      
      Fix memory leak in the Sorter. When the UnsafeExternalSorter spills the data to disk, it does not free up the underlying pointer array. As a result, we see a lot of executor OOM and also memory under utilization.
      This is a regression partially introduced in PR https://github.com/apache/spark/pull/9241
      
      ## How was this patch tested?
      
      Tested by running a job and observed around 30% speedup after this change.
      
      Author: Sital Kedia <skedia@fb.com>
      
      Closes #12285 from sitalkedia/executor_oom.
      d187e7de
    • Reynold Xin's avatar
      [SPARK-14547] Avoid DNS resolution for reusing connections · c439d88e
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch changes the connection creation logic in the network client module to avoid DNS resolution when reusing connections.
      
      ## How was this patch tested?
      Testing in production. This is too difficult to test in isolation (for high fidelity unit tests, we'd need to change the DNS resolution behavior in the JVM).
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12315 from rxin/SPARK-14547.
      c439d88e
    • Davies Liu's avatar
      [SPARK-14544] [SQL] improve performance of SQL UI tab · 1ef5f8cf
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR improve the performance of SQL UI by:
      
      1) remove the details column in all executions page (the first page in SQL tab). We can check the details by enter the execution page.
      2) break-all is super slow in Chrome recently, so switch to break-word.
      3) Using "display: none" to hide a block.
      4) using one js closure for  for all the executions, not one for each.
      5) remove the height limitation of details, don't need to scroll it in the tiny window.
      
      ## How was this patch tested?
      
      Exists tests.
      
      ![ui](https://cloud.githubusercontent.com/assets/40902/14445712/68d7b258-0004-11e6-9b48-5d329b05d165.png)
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12311 from davies/ui_perf.
      1ef5f8cf
    • Terence Yim's avatar
      [SPARK-14513][CORE] Fix threads left behind after stopping SparkContext · 3e53de4b
      Terence Yim authored
      ## What changes were proposed in this pull request?
      
      Shutting down `QueuedThreadPool` used by Jetty `Server` to avoid threads leakage after SparkContext is stopped.
      
      Note: If this fix is going to apply to the `branch-1.6`, one more patch on the `NettyRpcEnv` class is needed so that the `NettyRpcEnv._fileServer.shutdown` is called in the `NettyRpcEnv.cleanup` method. This is due to the removal of `_fileServer` field in the `NettyRpcEnv` class in the master branch. Please advice if a second PR is necessary for bring this fix back to `branch-1.6`
      
      ## How was this patch tested?
      
      Ran the ./dev/run-tests locally
      
      Author: Terence Yim <terence@cask.co>
      
      Closes #12318 from chtyim/fixes/SPARK-14513-thread-leak.
      3e53de4b
    • bomeng's avatar
      [SPARK-14414][SQL] improve the error message class hierarchy · bcd20762
      bomeng authored
      ## What changes were proposed in this pull request?
      
      Before we are using `AnalysisException`, `ParseException`, `NoSuchFunctionException` etc when a parsing error encounters. I am trying to make it consistent and also **minimum** code impact to the current implementation by changing the class hierarchy.
      1. `NoSuchItemException` is removed, since it is an abstract class and it just simply takes a message string.
      2. `NoSuchDatabaseException`, `NoSuchTableException`, `NoSuchPartitionException` and `NoSuchFunctionException` now extends `AnalysisException`, as well as `ParseException`, they are all under `AnalysisException` umbrella, but you can also determine how to use them in a granular way.
      
      ## How was this patch tested?
      The existing test cases should cover this patch.
      
      Author: bomeng <bmeng@us.ibm.com>
      
      Closes #12314 from bomeng/SPARK-14414.
      bcd20762
    • Davies Liu's avatar
      [SPARK-14562] [SQL] improve constraints propagation in Union · 85e68b4b
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, Union only takes intersect of the constraints from it's children, all others are dropped, we should try to merge them together.
      
      This PR try to merge the constraints that have the same reference but came from different children, for example: `a > 10` and `a < 100` could be merged as `a > 10 || a < 100`.
      
      ## How was this patch tested?
      
      Added more cases in existing test.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12328 from davies/union_const.
      85e68b4b
    • Liwei Lin's avatar
      [SPARK-14556][SQL] Code clean-ups for package o.a.s.sql.execution.streaming.state · 852bbc6c
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      - `StateStoreConf.**max**DeltasForSnapshot` was renamed to `StateStoreConf.**min**DeltasForSnapshot`
      - some state switch checks were added
      - improved consistency between method names and string literals
      - other comments & typo fix
      
      ## How was this patch tested?
      
      N/A
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #12323 from lw-lin/streaming-state-clean-up.
      852bbc6c
    • Yanbo Liang's avatar
      [SPARK-14147][ML][SPARKR] SparkR predict should not output feature column · 111a6247
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      SparkR does not support type of vector which is the default type of feature column in ML. R predict also does not output intermediate feature column. So SparkR ```predict``` should not output feature column. In this PR, I only fix this issue for ```naiveBayes``` and ```survreg```. ```kmeans``` has the right code route already and  ```glm``` will be fixed at SparkRWrapper refactor(#12294).
      
      ## How was this patch tested?
      No new tests.
      
      cc mengxr shivaram
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11958 from yanboliang/spark-14147.
      111a6247
    • Xiangrui Meng's avatar
      [SPARK-14563][ML] use a random table name instead of __THIS__ in SQLTransformer · 1995c2e6
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      Use a random table name instead of `__THIS__` in SQLTransformer, and add a test for `transformSchema`. The problems of using `__THIS__` are:
      
      * It doesn't work under HiveContext (in Spark 1.6)
      * Race conditions
      
      ## How was this patch tested?
      
      * Manual test with HiveContext.
      * Added a unit test for `transformSchema` to improve coverage.
      
      cc: yhuai
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #12330 from mengxr/SPARK-14563.
      1995c2e6
    • Kai Jiang's avatar
      [SPARK-13597][PYSPARK][ML] Python API for GeneralizedLinearRegression · 7f024c47
      Kai Jiang authored
      ## What changes were proposed in this pull request?
      
      Python API for GeneralizedLinearRegression
      JIRA: https://issues.apache.org/jira/browse/SPARK-13597
      
      ## How was this patch tested?
      
      The patch is tested with Python doctest.
      
      Author: Kai Jiang <jiangkai@gmail.com>
      
      Closes #11468 from vectorijk/spark-13597.
      7f024c47
    • Yanbo Liang's avatar
      [SPARK-13322][ML] AFTSurvivalRegression supports feature standardization · 101663f1
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      AFTSurvivalRegression should support feature standardization, it will improve the convergence rate.
      Test the convergence rate on the [Ovarian](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/ovarian.html) data which is standard data comes with Survival library in R,
      * without standardization(before this PR) -> 74 iterations.
      * with standardization(after this PR) -> 38 iterations.
      
      But after this fix, with or without ```standardization``` will converge to the same solution. It means that ```standardization = false``` will run the same code route as ```standardization = true```. Because if the features are not standardized at all, it will result convergency issue when the features have very different scales. This behavior is the same as ML [```LinearRegression``` and ```LogisticRegression```](https://issues.apache.org/jira/browse/SPARK-8522). See more discussion about this topic at #11247.
      cc mengxr
      ## How was this patch tested?
      unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11365 from yanboliang/spark-13322.
      101663f1
    • Yanbo Liang's avatar
      [SPARK-12566][SPARK-14324][ML] GLM model family, link function support in SparkR:::glm · 75e05a5a
      Yanbo Liang authored
      * SparkR glm supports families and link functions which match R's signature for family.
      * SparkR glm API refactor. The comparative standard of the new API is R glm, so I only expose the arguments that R glm supports: ```formula, family, data, epsilon and maxit```.
      * This PR is focus on glm() and predict(), summary statistics will be done in a separate PR after this get in.
      * This PR depends on #12287 which make GLMs support link prediction at Scala side. After that merged, I will add more tests for predict() to this PR.
      
      Unit tests.
      
      cc mengxr jkbradley hhbyyh
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12294 from yanboliang/spark-12566.
      75e05a5a
    • Shixiong Zhu's avatar
      [SPARK-14474][SQL] Move FileSource offset log into checkpointLocation · 6bf69214
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Now that we have a single location for storing checkpointed state. This PR just propagates the checkpoint location into FileStreamSource so that we don't have one random log off on its own.
      
      ## How was this patch tested?
      
      test("metadataPath should be in checkpointLocation")
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #12247 from zsxwing/file-source-log-location.
      6bf69214
    • Yong Tang's avatar
      [SPARK-3724][ML] RandomForest: More options for feature subset size. · da60b34d
      Yong Tang authored
      ## What changes were proposed in this pull request?
      
      This PR tries to support more options for feature subset size in RandomForest implementation. Previously, RandomForest only support "auto", "all", "sort", "log2", "onethird". This PR tries to support any given value to allow model search.
      
      In this PR, `featureSubsetStrategy` could be passed with:
      a) a real number in the range of `(0.0-1.0]` that represents the fraction of the number of features in each subset,
      b)  an integer number (`>0`) that represents the number of features in each subset.
      
      ## How was this patch tested?
      
      Two tests `JavaRandomForestClassifierSuite` and `JavaRandomForestRegressorSuite` have been updated to check the additional options for params in this PR.
      An additional test has been added to `org.apache.spark.mllib.tree.RandomForestSuite` to cover the cases in this PR.
      
      Author: Yong Tang <yong.tang.github@outlook.com>
      
      Closes #11989 from yongtang/SPARK-3724.
      da60b34d
    • Cheng Lian's avatar
      [SPARK-14488][SPARK-14493][SQL] "CREATE TEMPORARY TABLE ... USING ... AS... · 124cbfb6
      Cheng Lian authored
      [SPARK-14488][SPARK-14493][SQL] "CREATE TEMPORARY TABLE ... USING ... AS SELECT" shouldn't create persisted table
      
      ## What changes were proposed in this pull request?
      
      When planning logical plan node `CreateTableUsingAsSelect`, we neglected its `temporary` field and always generates a `CreateMetastoreDataSourceAsSelect`. This PR fixes this issue generating `CreateTempTableUsingAsSelect` when `temporary` is true.
      
      This PR also fixes SPARK-14493 since the root cause of SPARK-14493 is that we were `CreateMetastoreDataSourceAsSelect` uses default Hive warehouse location when `PATH` data source option is absent.
      
      ## How was this patch tested?
      
      Added a test case to create a temporary table using the target syntax and check whether it's indeed a temporary table.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #12303 from liancheng/spark-14488-fix-ctas-using.
      124cbfb6
    • Dongjoon Hyun's avatar
      [SPARK-14508][BUILD] Add a new ScalaStyle Rule `OmitBracesInCase` · b0f5497e
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) and [Scala Style Guide](http://docs.scala-lang.org/style/control-structures.html#curlybraces), we had better enforce the following rule.
        ```
        case: Always omit braces in case clauses.
        ```
      This PR makes a new ScalaStyle rule, 'OmitBracesInCase', and enforces it to the code.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including Scala style checking)
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12280 from dongjoon-hyun/SPARK-14508.
      b0f5497e
    • Wenchen Fan's avatar
      [SPARK-14535][SQL] Remove buildInternalScan from FileFormat · 678b96e7
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Now `HadoopFsRelation` with all kinds of file formats can be handled in `FileSourceStrategy`, we can remove the branches for  `HadoopFsRelation` in `FileSourceStrategy` and the `buildInternalScan` API from `FileFormat`.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12300 from cloud-fan/remove.
      678b96e7
    • Wenchen Fan's avatar
      [SPARK-14554][SQL] disable whole stage codegen if there are too many input columns · 52a80112
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In https://github.com/apache/spark/pull/12047/files#diff-94a1f59bcc9b6758c4ca874652437634R529, we may split field expressions codes in `CreateExternalRow` to support wide table. However, the whole stage codegen framework doesn't support it, because the input for expressions is not always the input row, but can be `CodeGenContext.currentVars`, which doesn't work well with `CodeGenContext.splitExpressions`.
      
      Actually we do have a check to guard against this cases, but it's incomplete, it only checks output fields.
      
      This PR improves the whole stage codegen support check, to disable it if there are too many input fields, so that we can avoid splitting field expressions codes in `CreateExternalRow` for whole stage codegen.
      
      TODO: Is it a better solution if we can make `CodeGenContext.currentVars` work well with `CodeGenContext.splitExpressions`?
      
      ## How was this patch tested?
      
      new test in DatasetSuite.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12322 from cloud-fan/codegen.
      52a80112
    • gatorsmile's avatar
      [SPARK-14362][SPARK-14406][SQL][FOLLOW-UP] DDL Native Support: Drop View and Drop Table · 2d81ba54
      gatorsmile authored
      #### What changes were proposed in this pull request?
      In this PR, we are trying to address the comment in the original PR: https://github.com/apache/spark/commit/dfce9665c4b2b29a19e6302216dae2800da68ff9#commitcomment-17057030
      
      In this PR, we checks if table/view exists at the beginning and then does not need to capture the exceptions, including `NoSuchTableException` and `InvalidTableException`. We still capture the NonFatal exception when doing `sqlContext.cacheManager.tryUncacheQuery`.
      
      #### How was this patch tested?
      The existing test cases should cover the code changes of this PR.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #12321 from gatorsmile/dropViewFollowup.
      2d81ba54
  2. Apr 11, 2016
    • Andrew Or's avatar
      [SPARK-14132][SPARK-14133][SQL] Alter table partition DDLs · 83fb9640
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      This implements a few alter table partition commands using the `SessionCatalog`. In particular:
      ```
      ALTER TABLE ... ADD PARTITION ...
      ALTER TABLE ... DROP PARTITION ...
      ALTER TABLE ... RENAME PARTITION ... TO ...
      ```
      The following operations are not supported, and an `AnalysisException` with a helpful error message will be thrown if the user tries to use them:
      ```
      ALTER TABLE ... EXCHANGE PARTITION ...
      ALTER TABLE ... ARCHIVE PARTITION ...
      ALTER TABLE ... UNARCHIVE PARTITION ...
      ALTER TABLE ... TOUCH ...
      ALTER TABLE ... COMPACT ...
      ALTER TABLE ... CONCATENATE
      MSCK REPAIR TABLE ...
      ```
      
      ## How was this patch tested?
      
      `DDLSuite`, `DDLCommandSuite` and `HiveDDLCommandSuite`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12220 from andrewor14/alter-partition-ddl.
      83fb9640
    • Joseph K. Bradley's avatar
      [MINOR][ML] Fixed MLlib build warnings · e9e1adc0
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Fixes to eliminate warnings during package and doc builds.
      
      ## How was this patch tested?
      
      Existing unit tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12263 from jkbradley/warning-cleanups.
      e9e1adc0
    • Liang-Chi Hsieh's avatar
      [SPARK-14520][SQL] Use correct return type in VectorizedParquetInputFormat · 26d7af91
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      JIRA: https://issues.apache.org/jira/browse/SPARK-14520
      
      `VectorizedParquetInputFormat` inherits `ParquetInputFormat` and overrides `createRecordReader`. However, its overridden `createRecordReader` returns a `ParquetRecordReader`. It should return a `RecordReader`. Otherwise, `ClassCastException` will be thrown.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #12292 from viirya/fix-vectorized-input-format.
      26d7af91
    • Eric Liang's avatar
      [SPARK-14475] Propagate user-defined context from driver to executors · 6f27027d
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This adds a new API call `TaskContext.getLocalProperty` for getting properties set in the driver from executors. These local properties are automatically propagated from the driver to executors. For streaming, the context for streaming tasks will be the initial driver context when ssc.start() is called.
      
      ## How was this patch tested?
      
      Unit tests.
      
      cc JoshRosen
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #12248 from ericl/sc-2813.
      6f27027d
    • Luciano Resende's avatar
      [SPARK-10521][SQL] Utilize Docker for test DB2 JDBC Dialect support · 94de6305
      Luciano Resende authored
      Add integration tests based on docker to test DB2 JDBC dialect support
      
      Author: Luciano Resende <lresende@apache.org>
      
      Closes #9893 from lresende/SPARK-10521.
      94de6305
    • Yanbo Liang's avatar
      [SPARK-14298][ML][MLLIB] Add unit test for EM LDA disable checkpointing · 3f0f4080
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      This is follow up for #12089, add unit test for EM LDA which test disable checkpointing when set ```checkpointInterval = -1```.
      ## How was this patch tested?
      unit test.
      
      cc jkbradley
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12286 from yanboliang/spark-14298-followup.
      3f0f4080
    • Oliver Pierson's avatar
      [SPARK-13600][MLLIB] Use approxQuantile from DataFrame stats in QuantileDiscretizer · 89a41c5b
      Oliver Pierson authored
      ## What changes were proposed in this pull request?
      QuantileDiscretizer can return an unexpected number of buckets in certain cases.  This PR proposes to fix this issue and also refactor QuantileDiscretizer to use approxQuantiles from DataFrame stats functions.
      ## How was this patch tested?
      QuantileDiscretizerSuite unit tests (some existing tests will change or even be removed in this PR)
      
      Author: Oliver Pierson <ocp@gatech.edu>
      
      Closes #11553 from oliverpierson/SPARK-13600.
      89a41c5b
    • Shixiong Zhu's avatar
      [SPARK-14494][SQL] Fix the race conditions in MemoryStream and MemorySink · 2dacc81e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Make sure accessing mutable variables in MemoryStream and MemorySink are protected by `synchronized`.
      This is probably why MemorySinkSuite failed here: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.2/650/testReport/junit/org.apache.spark.sql.streaming/MemorySinkSuite/registering_as_a_table/
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #12261 from zsxwing/memory-race-condition.
      2dacc81e
    • Dongjoon Hyun's avatar
      [SPARK-14502] [SQL] Add optimization for Binary Comparison Simplification · 5de26194
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      We can simplifies binary comparisons with semantically-equal operands:
      
      1. Replace '<=>' with 'true' literal.
      2. Replace '=', '<=', and '>=' with 'true' literal if both operands are non-nullable.
      3. Replace '<' and '>' with 'false' literal if both operands are non-nullable.
      
      For example, the following example plan
      ```
      scala> sql("SELECT * FROM (SELECT explode(array(1,2,3)) a) T WHERE a BETWEEN a AND a+7").explain()
      ...
      :  +- Filter ((a#59 >= a#59) && (a#59 <= (a#59 + 7)))
      ...
      ```
      will be optimized into the following.
      ```
      :  +- Filter (a#47 <= (a#47 + 7))
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests including new `BinaryComparisonSimplificationSuite`.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12267 from dongjoon-hyun/SPARK-14502.
      5de26194
    • Davies Liu's avatar
      [SPARK-14528] [SQL] Fix same result of Union · 652c4703
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR fix resultResult() for Union.
      
      ## How was this patch tested?
      
      Added regression test.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12295 from davies/fix_sameResult.
      652c4703
    • DB Tsai's avatar
      [SPARK-14462][ML][MLLIB] Add the mllib-local build to maven pom · efaf7d18
      DB Tsai authored
      ## What changes were proposed in this pull request?
      
      In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies.
      
      The previous PR was failing the build because of `spark-core:test` dependency, and that was reverted. In this PR, `FunSuite` with `// scalastyle:ignore funsuite` in mllib-local test was used, similar to sketch.
      
      Thanks.
      
      ## How was this patch tested?
      
      Unit tests
      
      mengxr tedyu holdenk
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #12298 from dbtsai/dbtsai-mllib-local-build-fix.
      efaf7d18
    • Zheng RuiFeng's avatar
      [SPARK-14510][MLLIB] Add args-checking for LDA and StreamingKMeans · 643b4e22
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      add the checking for LDA and StreamingKMeans
      
      ## How was this patch tested?
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12062 from zhengruifeng/initmodel.
      643b4e22
    • Xiangrui Meng's avatar
      [SPARK-14500] [ML] Accept Dataset[_] instead of DataFrame in MLlib APIs · 1c751fcf
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      This PR updates MLlib APIs to accept `Dataset[_]` as input where `DataFrame` was the input type. This PR doesn't change the output type. In Java, `Dataset[_]` maps to `Dataset<?>`, which includes `Dataset<Row>`. Some implementations were changed in order to return `DataFrame`. Tests and examples were updated. Note that this is a breaking change for subclasses of Transformer/Estimator.
      
      Lol, we don't have to rename the input argument, which has been `dataset` since Spark 1.2.
      
      TODOs:
      - [x] update MiMaExcludes (seems all covered by explicit filters from SPARK-13920)
      - [x] Python
      - [x] add a new test to accept Dataset[LabeledPoint]
      - [x] remove unused imports of Dataset
      
      ## How was this patch tested?
      
      Exiting unit tests with some modifications.
      
      cc: rxin jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #12274 from mengxr/SPARK-14500.
      1c751fcf
    • Rekha Joshi's avatar
      [SPARK-14372][SQL] Dataset.randomSplit() needs a Java version · e82d95bf
      Rekha Joshi authored
      ## What changes were proposed in this pull request?
      
      1.Added method randomSplitAsList() in Dataset for java
      for https://issues.apache.org/jira/browse/SPARK-14372
      
      ## How was this patch tested?
      
      TestSuite
      
      Author: Rekha Joshi <rekhajoshm@gmail.com>
      Author: Joshi <rekhajoshm@gmail.com>
      
      Closes #12184 from rekhajoshm/SPARK-14372.
      e82d95bf
    • Dongjoon Hyun's avatar
      [MINOR][DOCS] Fix wrong data types in JSON Datasets example. · 1a0cca1f
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the `age` data types from `integer` to `long` in `SQL Programming Guide: JSON Datasets`.
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12290 from dongjoon-hyun/minor_fix_type_in_json_example.
      1a0cca1f
  3. Apr 10, 2016
    • gatorsmile's avatar
      [SPARK-14362][SPARK-14406][SQL][FOLLOW-UP] DDL Native Support: Drop View and Drop Table · 9f838bd2
      gatorsmile authored
      #### What changes were proposed in this pull request?
      This PR is to address the comment: https://github.com/apache/spark/pull/12146#discussion-diff-59092238. It removes the function `isViewSupported` from `SessionCatalog`. After the removal, we still can capture the user errors if users try to drop a table using `DROP VIEW`.
      
      #### How was this patch tested?
      Modified the existing test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #12284 from gatorsmile/followupDropTable.
      9f838bd2
    • Davies Liu's avatar
      [SPARK-14419] [MINOR] coding style cleanup · fbf8d008
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Making them more consistent.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12289 from davies/cleanup_style.
      fbf8d008
    • Dongjoon Hyun's avatar
      [SPARK-14415][SQL] All functions should show usages by command `DESC FUNCTION` · a7ce473b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, many functions do now show usages like the followings.
      ```
      scala> sql("desc function extended `sin`").collect().foreach(println)
      [Function: sin]
      [Class: org.apache.spark.sql.catalyst.expressions.Sin]
      [Usage: To be added.]
      [Extended Usage:
      To be added.]
      ```
      
      This PR adds descriptions for functions and adds a testcase prevent adding function without usage.
      ```
      scala>  sql("desc function extended `sin`").collect().foreach(println);
      [Function: sin]
      [Class: org.apache.spark.sql.catalyst.expressions.Sin]
      [Usage: sin(x) - Returns the sine of x.]
      [Extended Usage:
      > SELECT sin(0);
       0.0]
      ```
      
      The only exceptions are `cube`, `grouping`, `grouping_id`, `rollup`, `window`.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including new testcases.)
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12185 from dongjoon-hyun/SPARK-14415.
      a7ce473b
    • Örjan Lundberg's avatar
      Update KMeansExample.scala · b5c78562
      Örjan Lundberg authored
      ## What changes were proposed in this pull request?
      example does not work wo DataFrame import
      
      ## How was this patch tested?
      
      example doc only
      
      example does not work wo DataFrame import
      
      Author: Örjan Lundberg <orjan.lundberg@gmail.com>
      
      Closes #12277 from oluies/patch-1.
      b5c78562
Loading