Skip to content
Snippets Groups Projects
  1. Mar 29, 2016
    • Carson Wang's avatar
      [SPARK-14232][WEBUI] Fix event timeline display issue when an executor is... · 15c0b000
      Carson Wang authored
      [SPARK-14232][WEBUI] Fix event timeline display issue when an executor is removed with a multiple line reason.
      
      ## What changes were proposed in this pull request?
      The event timeline doesn't show on job page if an executor is removed with a multiple line reason. This PR replaces all new line characters in the reason string with spaces.
      
      ![timelineerror](https://cloud.githubusercontent.com/assets/9278199/14100211/5fd4cd30-f5be-11e5-9cea-f32651a4cd62.jpg)
      
      ## How was this patch tested?
      Verified on the Web UI.
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #12029 from carsonwang/eventTimeline.
      15c0b000
    • Yuhao Yang's avatar
      [SPARK-14154][MLLIB] Simplify the implementation for Kolmogorov–Smirnov test · d2a819a6
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      jira: https://issues.apache.org/jira/browse/SPARK-14154
      
      I just read the code for KolmogorovSmirnovTest and find it could be much simplified following the original definition.
      
      Send a PR for discussion
      
      ## How was this patch tested?
      unit test
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #11954 from hhbyyh/ksoptimize.
      d2a819a6
    • Cheng Lian's avatar
      [SPARK-14208][SQL] Renames spark.sql.parquet.fileScan · a632bb56
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      Renames SQL option `spark.sql.parquet.fileScan` since now all `HadoopFsRelation` based data sources are being migrated to `FileScanRDD` code path.
      
      ## How was this patch tested?
      
      None.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #12003 from liancheng/spark-14208-option-renaming.
      a632bb56
    • Bryan Cutler's avatar
      [SPARK-13963][ML] Adding binary toggle param to HashingTF · 425bcf6d
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Adding binary toggle parameter to ml.feature.HashingTF, as well as mllib.feature.HashingTF since the former wraps this functionality.  This parameter, if true, will set non-zero valued term counts to 1 to transform term count features to binary values that are well suited for discrete probability models.
      
      ## How was this patch tested?
      Added unit tests for ML and MLlib
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #11832 from BryanCutler/binary-param-HashingTF-SPARK-13963.
      425bcf6d
    • Wenchen Fan's avatar
      [SPARK-14158][SQL] implement buildReader for json data source · 83775bc7
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR implements buildReader for json data source and enable it in the new data source code path.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11960 from cloud-fan/json.
      83775bc7
    • wm624@hotmail.com's avatar
      [SPARK-14071][PYSPARK][ML] Change MLWritable.write to be a property · 63b200e8
      wm624@hotmail.com authored
      Add property to MLWritable.write method, so we can use .write instead of .write()
      
      Add a new test to ml/test.py to check whether the write is a property.
      ./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
      
      Will test against the following Python executables: ['python2.7']
      Will test the following Python modules: ['pyspark-ml']
      Finished test(python2.7): pyspark.ml.evaluation (11s)
      Finished test(python2.7): pyspark.ml.clustering (16s)
      Finished test(python2.7): pyspark.ml.classification (24s)
      Finished test(python2.7): pyspark.ml.recommendation (24s)
      Finished test(python2.7): pyspark.ml.feature (39s)
      Finished test(python2.7): pyspark.ml.regression (26s)
      Finished test(python2.7): pyspark.ml.tuning (15s)
      Finished test(python2.7): pyspark.ml.tests (30s)
      Tests passed in 55 seconds
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #11945 from wangmiao1981/fix_property.
      63b200e8
    • sethah's avatar
      [SPARK-11730][ML] Add feature importances for GBTs. · f6066b0c
      sethah authored
      ## What changes were proposed in this pull request?
      
      Now that GBTs have been moved to ML, they can use the implementation of feature importance for random forests. This patch simply adds a `featureImportances` attribute to `GBTClassifier` and `GBTRegressor` and adds tests for each.
      
      GBT feature importances here simply average the feature importances for each tree in its ensemble. This follows the implementation from scikit-learn. This method is also suggested by J Friedman in [this paper](https://statweb.stanford.edu/~jhf/ftp/trebst.pdf).
      
      ## How was this patch tested?
      
      Unit tests were added to `GBTClassifierSuite` and `GBTRegressorSuite` to validate feature importances.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #11961 from sethah/SPARK-11730.
      f6066b0c
  2. Mar 28, 2016
    • Sun Rui's avatar
      [SPARK-12792] [SPARKR] Refactor RRDD to support R UDF. · d3638d7b
      Sun Rui authored
      ## What changes were proposed in this pull request?
      
      Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs.
      
      Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later.
      
      ## How was this patch tested?
      dev/lint-r
      SparkR unit tests
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #12024 from sun-rui/SPARK-12792_new.
      d3638d7b
    • Nong Li's avatar
      [SPARK-14210] [SQL] Add a metric for time spent in scans. · a180286b
      Nong Li authored
      ## What changes were proposed in this pull request?
      
      This adds a metric to parquet scans that measures the time in just the scan phase. This is
      only possible when the scan returns ColumnarBatches, otherwise the overhead is too high.
      
      This combined with the pipeline metric lets us easily see what percent of the time was
      in the scan.
      
      Author: Nong Li <nong@databricks.com>
      
      Closes #12007 from nongli/spark-14210.
      a180286b
    • Nong Li's avatar
      [SPARK-13981][SQL] Defer evaluating variables within Filter operator. · 4a55c336
      Nong Li authored
      ## What changes were proposed in this pull request?
      
      This improves the Filter codegen for NULLs by deferring loading the values for IsNotNull.
      Instead of generating code like:
      
      boolean isNull = ...
      int value = ...
      if (isNull) continue;
      
      we will generate:
      boolean isNull = ...
      if (isNull) continue;
      int value = ...
      
      This is useful since retrieving the values can be non-trivial (they can be dictionary encoded
      among other things). This currently only works when the attribute comes from the column batch
      but could be extended to other cases in the future.
      
      ## How was this patch tested?
      
      On tpcds q55, this fixes the regression from introducing the IsNotNull predicates.
      
      ```
      TPCDS Snappy:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)
      --------------------------------------------------------------------------------
      q55                                      4564 / 5036         25.2          39.6
      q55                                      4064 / 4340         28.3          35.3
      ```
      
      Author: Nong Li <nong@databricks.com>
      
      Closes #11792 from nongli/spark-13981.
      4a55c336
    • Herman van Hovell's avatar
      [SPARK-14213][SQL] Migrate HiveQl parsing to ANTLR4 parser · 27d4ef0c
      Herman van Hovell authored
      ### What changes were proposed in this pull request?
      
      This PR migrates all HiveQl parsing to the new ANTLR4 parser. This PR is build on top of https://github.com/apache/spark/pull/12011, and we should wait with merging until that one is in (hence the WIP tag).
      
      As soon as this PR is merged we can start removing much of the old parser infrastructure.
      
      ### How was this patch tested?
      
      Exisiting Hive unit tests.
      
      cc rxin andrewor14 yhuai
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #12015 from hvanhovell/SPARK-14213.
      27d4ef0c
    • Wenchen Fan's avatar
      [SPARK-14205][SQL] remove trait Queryable · 38326cad
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      After DataFrame and Dataset are merged, the trait `Queryable` becomes unnecessary as it has only one implementation. We should remove it.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12001 from cloud-fan/df-ds.
      38326cad
    • Dongjoon Hyun's avatar
      [SPARK-14219][GRAPHX] Fix `pickRandomVertex` not to fall into infinite loops... · 289257c4
      Dongjoon Hyun authored
      [SPARK-14219][GRAPHX] Fix `pickRandomVertex` not to fall into infinite loops for graphs with one vertex
      
      ## What changes were proposed in this pull request?
      
      Currently, `GraphOps.pickRandomVertex()` falls into infinite loops for graphs having only one vertex. This PR fixes it by modifying the following termination-checking condition.
      ```scala
      -      if (selectedVertices.count > 1) {
      +      if (selectedVertices.count > 0) {
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including new test case).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12018 from dongjoon-hyun/SPARK-14219.
      289257c4
    • jerryshao's avatar
      [SPARK-13447][YARN][CORE] Clean the stale states for AM failure and restart situation · 2bc7c96d
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up fix of #9963, in #9963 we handle this stale states clean-up work only for dynamic allocation enabled scenario. Here we should also clean the states in `CoarseGrainedSchedulerBackend` for dynamic allocation disabled scenario.
      
      Please review, CC andrewor14 lianhuiwang , thanks a lot.
      
      ## How was this patch tested?
      
      Run the unit test locally, also with integration test manually.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #11366 from jerryshao/SPARK-13447.
      2bc7c96d
    • jeanlyn's avatar
      [SPARK-13845][CORE] Using onBlockUpdated to replace onTaskEnd avioding driver OOM · ad9e3d50
      jeanlyn authored
      ## What changes were proposed in this pull request?
      
      We have a streaming job using `FlumePollInputStream` always driver OOM after few days, here is some driver heap dump before OOM
      ```
       num     #instances         #bytes  class name
      ----------------------------------------------
         1:      13845916      553836640  org.apache.spark.storage.BlockStatus
         2:      14020324      336487776  org.apache.spark.storage.StreamBlockId
         3:      13883881      333213144  scala.collection.mutable.DefaultEntry
         4:          8907       89043952  [Lscala.collection.mutable.HashEntry;
         5:         62360       65107352  [B
         6:        163368       24453904  [Ljava.lang.Object;
         7:        293651       20342664  [C
      ...
      ```
      `BlockStatus` and `StreamBlockId` keep on growing, and the driver OOM in the end.
      After investigated, i found the `executorIdToStorageStatus` in `StorageStatusListener` seems never remove the blocks from `StorageStatus`.
      In order to fix the issue, i try to use `onBlockUpdated` replace `onTaskEnd ` , so we can update the block informations(add blocks, drop the block from memory to disk and delete the blocks) in time.
      
      ## How was this patch tested?
      
      Existing unit tests and manual tests
      
      Author: jeanlyn <jeanlyn92@gmail.com>
      
      Closes #11779 from jeanlyn/fix_driver_oom.
      ad9e3d50
    • Andrew Or's avatar
      [SPARK-14119][SPARK-14120][SPARK-14122][SQL] Throw exception on unsupported DDL commands · a916d2a4
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Before: We just pass all role commands to Hive even though it doesn't work.
      After: We throw an `AnalysisException` that looks like this:
      
      ```
      scala> sql("CREATE ROLE x")
      org.apache.spark.sql.AnalysisException: Unsupported Hive operation: CREATE ROLE;
        at org.apache.spark.sql.hive.HiveQl$$anonfun$parsePlan$1.apply(HiveQl.scala:213)
        at org.apache.spark.sql.hive.HiveQl$$anonfun$parsePlan$1.apply(HiveQl.scala:208)
        at org.apache.spark.sql.catalyst.parser.CatalystQl.safeParse(CatalystQl.scala:49)
        at org.apache.spark.sql.hive.HiveQl.parsePlan(HiveQl.scala:208)
        at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:198)
      ```
      
      ## How was this patch tested?
      
      `HiveQuerySuite`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #11948 from andrewor14/ddl-role-management.
      a916d2a4
    • Andrew Or's avatar
      [SPARK-14013][SQL] Proper temp function support in catalog · 27aab806
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Session catalog was added in #11750. However, it doesn't really support temporary functions properly; right now we only store the metadata in the form of `CatalogFunction`, but this doesn't make sense for temporary functions because there is no class name.
      
      This patch moves the `FunctionRegistry` into the `SessionCatalog`. With this, the user can call `catalog.createTempFunction` and `catalog.lookupFunction` to use the function they registered previously. This is currently still dead code, however.
      
      ## How was this patch tested?
      
      `SessionCatalogSuite`.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #11972 from andrewor14/temp-functions.
      27aab806
    • Shixiong Zhu's avatar
      [SPARK-14169][CORE] Add UninterruptibleThread · 2f98ee67
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Extract the workaround for HADOOP-10622 introduced by #11940 into UninterruptibleThread so that we can test and reuse it.
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11971 from zsxwing/uninterrupt.
      2f98ee67
    • Reynold Xin's avatar
      [SPARK-14155][SQL] Hide UserDefinedType interface in Spark 2.0 · b7836492
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      UserDefinedType is a developer API in Spark 1.x. With very high probability we will create a new API for user-defined type that also works well with column batches as well as encoders (datasets). In Spark 2.0, let's make `UserDefinedType` `private[spark]` first.
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11955 from rxin/SPARK-14155.
      b7836492
    • Andrew Or's avatar
      [SPARK-13923][SPARK-14014][SQL] Session catalog follow-ups · eebc8c1c
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      This patch addresses the remaining comments left in #11750 and #11918 after they are merged. For a full list of changes in this patch, just trace the commits.
      
      ## How was this patch tested?
      
      `SessionCatalogSuite` and `CatalogTestCases`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12006 from andrewor14/session-catalog-followup.
      eebc8c1c
    • Shixiong Zhu's avatar
      [SPARK-14180][CORE] Fix a deadlock in CoarseGrainedExecutorBackend Shutdown · 34c0638e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Call `executor.stop` in a new thread to eliminate deadlock.
      
      ## How was this patch tested?
      
      Existing unit tests
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #12012 from zsxwing/SPARK-14180.
      34c0638e
    • Herman van Hovell's avatar
      [SPARK-14086][SQL] Add DDL commands to ANTLR4 parser · 328c7116
      Herman van Hovell authored
      #### What changes were proposed in this pull request?
      
      This PR adds all the current Spark SQL DDL commands to the new ANTLR 4 based SQL parser.
      
      I have found a few inconsistencies in the current commands:
      - Function has an alias field. This is actually the class name of the function.
      - Partition specifications should contain nulls in some commands, and contain `None`s in others.
      - `AlterTableSkewedLocation`: Should defines which columns have skewed values, and should allow us to define storage for each skewed combination of values. We currently only allow one value per field.
      - `AlterTableSetFileFormat`: Should only have one file format, it currently supports both.
      
      I have implemented all these comments like they were, and I propose to improve them in follow-up PRs.
      
      #### How was this patch tested?
      
      The existing DDLCommandSuite.
      
      cc rxin andrewor14 yhuai
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #12011 from hvanhovell/SPARK-14086.
      328c7116
    • Xusen Yin's avatar
      [SPARK-11893] Model export/import for spark.ml: TrainValidationSplit · 8c11d1aa
      Xusen Yin authored
      https://issues.apache.org/jira/browse/SPARK-11893
      
      jkbradley In order to share read/write with `TrainValidationSplit`, I move the `SharedReadWrite` out of `CrossValidator` into a new trait `SharedReadWrite` in the tunning package.
      
      To reduce the repeated tests, I move the complex tests from `CrossValidatorSuite` to `SharedReadWriteSuite`, and create a fake validator called `MyValidator` to test the shared code.
      
      With `SharedReadWrite`, potential newly added `Validator` can share the read/write common part, and only need to implement their extra params save/load.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #9971 from yinxusen/SPARK-11893.
      8c11d1aa
    • zero323's avatar
      [SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo… · 39f743a6
      zero323 authored
      ## What changes were proposed in this pull request?
      
      This PR replaces list comprehension in python_full_outer_join.dispatch with a generator expression.
      
      ## How was this patch tested?
      
      PySpark-Core, PySpark-MLlib test suites against Python 2.7, 3.5.
      
      Author: zero323 <matthew.szymkiewicz@gmail.com>
      
      Closes #11998 from zero323/pyspark-join-generator-expr.
      39f743a6
    • nfraison's avatar
      [SPARK-13622][YARN] Issue creating level db for YARN shuffle service · ff3bea38
      nfraison authored
      ## What changes were proposed in this pull request?
      This patch will ensure that we trim all path set in yarn.nodemanager.local-dirs and that the the scheme is well removed so the level db can be created.
      
      ## How was this patch tested?
      manual tests.
      
      Author: nfraison <nfraison@yahoo.fr>
      
      Closes #11475 from ashangit/level_db_creation_issue.
      ff3bea38
    • Yin Huai's avatar
      [SPARK-13713][SQL][TEST-MAVEN] Add Antlr4 maven plugin. · 7007f72b
      Yin Huai authored
      Seems https://github.com/apache/spark/commit/600c0b69cab4767e8e5a6f4284777d8b9d4bd40e is missing the antlr4 maven plugin. This pr adds it.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #12010 from yhuai/mavenAntlr4.
      7007f72b
    • Davies Liu's avatar
      [SPARK-14052] [SQL] build a BytesToBytesMap directly in HashedRelation · d7b58f14
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, for the key that can not fit within a long,  we build a hash map for UnsafeHashedRelation, it's converted to BytesToBytesMap after serialization and deserialization. We should build a BytesToBytesMap directly to have better memory efficiency.
      
      In order to do that, BytesToBytesMap should support multiple (K,V) pair with the same K,  Location.putNewKey() is renamed to Location.append(), which could append multiple values for the same key (same Location). `Location.newValue()` is added to find the next value for the same key.
      
      ## How was this patch tested?
      
      Existing tests. Added benchmark for broadcast hash join with duplicated keys.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11870 from davies/map2.
      d7b58f14
    • Herman van Hovell's avatar
      [SPARK-13713][SQL] Migrate parser from ANTLR3 to ANTLR4 · 600c0b69
      Herman van Hovell authored
      ### What changes were proposed in this pull request?
      The current ANTLR3 parser is quite complex to maintain and suffers from code blow-ups. This PR introduces a new parser that is based on ANTLR4.
      
      This parser is based on the [Presto's SQL parser](https://github.com/facebook/presto/blob/master/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4). The current implementation can parse and create Catalyst and SQL plans. Large parts of the HiveQl DDL and some of the DML functionality is currently missing, the plan is to add this in follow-up PRs.
      
      This PR is a work in progress, and work needs to be done in the following area's:
      
      - [x] Error handling should be improved.
      - [x] Documentation should be improved.
      - [x] Multi-Insert needs to be tested.
      - [ ] Naming and package locations.
      
      ### How was this patch tested?
      
      Catalyst and SQL unit tests.
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #11557 from hvanhovell/ngParser.
      600c0b69
    • Liang-Chi Hsieh's avatar
      [SPARK-14156][SQL] Use executedPlan in HiveComparisonTest for the messages of computed tables · 1528ff4c
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      JIRA: https://issues.apache.org/jira/browse/SPARK-14156
      
      In HiveComparisonTest, when catalyst results are different to hive results, we will collect the messages for computed tables during the test. During creating the message, we use sparkPlan. But we actually run the query with executedPlan. So the error message is sometimes confusing.
      
      For example, as wholestage codegen is enabled by default now. The shown spark plan for computed tables is the plan before wholestage codegen.
      
      A concrete is the following error message shown before this patch. It is the error shown when running `HiveCompatibilityTest` `auto_join26`.
      
      auto_join26 has one SQL to create table:
      
          INSERT OVERWRITE TABLE dest_j1
          SELECT  x.key, count(1) FROM src1 x JOIN src y ON (x.key = y.key) group by x.key;   (1)
      
      Then a SQL to retrieve the result:
      
          select * from dest_j1 x order by x.key;   (2)
      
      When the above SQL (2) to retrieve the result fails, In `HiveComparisonTest` we will try to collect and show the generated data from table `dest_j1` using the SQL (1)'s spark plan. The you will see this error:
      
          TungstenAggregate(key=[key#8804], functions=[(count(1),mode=Partial,isDistinct=false)], output=[key#8804,count#8834L])
          +- Project [key#8804]
             +- BroadcastHashJoin [key#8804], [key#8806], Inner, BuildRight, None
                :- Filter isnotnull(key#8804)
                :  +- InMemoryColumnarTableScan [key#8804], [isnotnull(key#8804)], InMemoryRelation [key#8804,value#8805], true, 5, StorageLevel(true, true, false, true, 1), HiveTableScan [key#8717,value#8718], MetastoreRelation default, src1, None, Some(src1)
                +- Filter isnotnull(key#8806)
                   +- InMemoryColumnarTableScan [key#8806], [isnotnull(key#8806)], InMemoryRelation [key#8806,value#8807], true, 5, StorageLevel(true, true, false, true, 1), HiveTableScan [key#8760,value#8761], MetastoreRelation default, src, None, Some(src)
      
      	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
      	at org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:82)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:121)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:121)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:140)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:137)
      	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:120)
      	at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:87)
      	at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:82)
      	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
      	... 70 more
          Caused by: java.lang.UnsupportedOperationException: Filter does not implement doExecuteBroadcast
      	at org.apache.spark.sql.execution.SparkPlan.doExecuteBroadcast(SparkPlan.scala:221)
      
      The message is confusing because it is not the plan actually run by SparkSQL engine to create the generated table. The plan actually run is no problem. But as before this patch, we run `e.sparkPlan.collect` to retrieve and show the generated data, spark plan is not the plan we can run. So the above error will be shown.
      
      After this patch, we won't see the error because the executed plan is no problem and works.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #11957 from viirya/use-executedplan.
      1528ff4c
    • Kazuaki Ishizaki's avatar
      [SPARK-13844] [SQL] Generate better code for filters with a non-nullable column · 4a7636f2
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR simplifies generated code with a non-nullable column. This PR addresses three items:
      1. Generate simplified code for and / or
      2. Generate better code for divide and remainder with non-zero dividend
      3. Pass nullable information into BoundReference at WholeStageCodegen
      
      I have attached the generated code with and without this PR
      
      ## How was this patch tested?
      
      Tested by existing test suites in sql/core
      
      Here is a motivating example
      ````
      (0 to 6).map(i => (i.toString, i.toInt)).toDF("k", "v")
        .filter("v % 2 == 0").filter("v <= 4").filter("v > 1").show()
      ````
      
      Generated code without this PR
      ````java
      /* 032 */   protected void processNext() throws java.io.IOException {
      /* 033 */     /*** PRODUCE: Project [_1#0 AS k#3,_2#1 AS v#4] */
      /* 034 */
      /* 035 */     /*** PRODUCE: Filter ((isnotnull((_2#1 % 2)) && ((_2#1 % 2) = 0)) && ((_2#1 <= 4) && (_2#1 > 1))) */
      /* 036 */
      /* 037 */     /*** PRODUCE: INPUT */
      /* 038 */
      /* 039 */     while (!shouldStop() && inputadapter_input.hasNext()) {
      /* 040 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
      /* 041 */       /*** CONSUME: Filter ((isnotnull((_2#1 % 2)) && ((_2#1 % 2) = 0)) && ((_2#1 <= 4) && (_2#1 > 1))) */
      /* 042 */       /* input[1, int] */
      /* 043 */       int filter_value1 = inputadapter_row.getInt(1);
      /* 044 */
      /* 045 */       /* isnotnull((input[1, int] % 2)) */
      /* 046 */       /* (input[1, int] % 2) */
      /* 047 */       boolean filter_isNull3 = false;
      /* 048 */       int filter_value3 = -1;
      /* 049 */       if (false || 2 == 0) {
      /* 050 */         filter_isNull3 = true;
      /* 051 */       } else {
      /* 052 */         if (false) {
      /* 053 */           filter_isNull3 = true;
      /* 054 */         } else {
      /* 055 */           filter_value3 = (int)(filter_value1 % 2);
      /* 056 */         }
      /* 057 */       }
      /* 058 */       if (!(!(filter_isNull3))) continue;
      /* 059 */
      /* 060 */       /* ((input[1, int] % 2) = 0) */
      /* 061 */       boolean filter_isNull6 = true;
      /* 062 */       boolean filter_value6 = false;
      /* 063 */       /* (input[1, int] % 2) */
      /* 064 */       boolean filter_isNull7 = false;
      /* 065 */       int filter_value7 = -1;
      /* 066 */       if (false || 2 == 0) {
      /* 067 */         filter_isNull7 = true;
      /* 068 */       } else {
      /* 069 */         if (false) {
      /* 070 */           filter_isNull7 = true;
      /* 071 */         } else {
      /* 072 */           filter_value7 = (int)(filter_value1 % 2);
      /* 073 */         }
      /* 074 */       }
      /* 075 */       if (!filter_isNull7) {
      /* 076 */         filter_isNull6 = false; // resultCode could change nullability.
      /* 077 */         filter_value6 = filter_value7 == 0;
      /* 078 */
      /* 079 */       }
      /* 080 */       if (filter_isNull6 || !filter_value6) continue;
      /* 081 */
      /* 082 */       /* (input[1, int] <= 4) */
      /* 083 */       boolean filter_value11 = false;
      /* 084 */       filter_value11 = filter_value1 <= 4;
      /* 085 */       if (!filter_value11) continue;
      /* 086 */
      /* 087 */       /* (input[1, int] > 1) */
      /* 088 */       boolean filter_value14 = false;
      /* 089 */       filter_value14 = filter_value1 > 1;
      /* 090 */       if (!filter_value14) continue;
      /* 091 */
      /* 092 */       filter_metricValue.add(1);
      /* 093 */
      /* 094 */       /*** CONSUME: Project [_1#0 AS k#3,_2#1 AS v#4] */
      /* 095 */
      /* 096 */       /* input[0, string] */
      /* 097 */       /* input[0, string] */
      /* 098 */       boolean filter_isNull = inputadapter_row.isNullAt(0);
      /* 099 */       UTF8String filter_value = filter_isNull ? null : (inputadapter_row.getUTF8String(0));
      /* 100 */       project_holder.reset();
      /* 101 */
      /* 102 */       project_rowWriter.zeroOutNullBytes();
      /* 103 */
      /* 104 */       if (filter_isNull) {
      /* 105 */         project_rowWriter.setNullAt(0);
      /* 106 */       } else {
      /* 107 */         project_rowWriter.write(0, filter_value);
      /* 108 */       }
      /* 109 */
      /* 110 */       project_rowWriter.write(1, filter_value1);
      /* 111 */       project_result.setTotalSize(project_holder.totalSize());
      /* 112 */       append(project_result.copy());
      /* 113 */     }
      /* 114 */   }
      /* 115 */ }
      ````
      
      Generated code with this PR
      ````java
      /* 032 */   protected void processNext() throws java.io.IOException {
      /* 033 */     /*** PRODUCE: Project [_1#0 AS k#3,_2#1 AS v#4] */
      /* 034 */
      /* 035 */     /*** PRODUCE: Filter (((_2#1 % 2) = 0) && ((_2#1 <= 5) && (_2#1 > 1))) */
      /* 036 */
      /* 037 */     /*** PRODUCE: INPUT */
      /* 038 */
      /* 039 */     while (!shouldStop() && inputadapter_input.hasNext()) {
      /* 040 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
      /* 041 */       /*** CONSUME: Filter (((_2#1 % 2) = 0) && ((_2#1 <= 5) && (_2#1 > 1))) */
      /* 042 */       /* input[1, int] */
      /* 043 */       int filter_value1 = inputadapter_row.getInt(1);
      /* 044 */
      /* 045 */       /* ((input[1, int] % 2) = 0) */
      /* 046 */       /* (input[1, int] % 2) */
      /* 047 */       int filter_value3 = (int)(filter_value1 % 2);
      /* 048 */
      /* 049 */       boolean filter_value2 = false;
      /* 050 */       filter_value2 = filter_value3 == 0;
      /* 051 */       if (!filter_value2) continue;
      /* 052 */
      /* 053 */       /* (input[1, int] <= 5) */
      /* 054 */       boolean filter_value7 = false;
      /* 055 */       filter_value7 = filter_value1 <= 5;
      /* 056 */       if (!filter_value7) continue;
      /* 057 */
      /* 058 */       /* (input[1, int] > 1) */
      /* 059 */       boolean filter_value10 = false;
      /* 060 */       filter_value10 = filter_value1 > 1;
      /* 061 */       if (!filter_value10) continue;
      /* 062 */
      /* 063 */       filter_metricValue.add(1);
      /* 064 */
      /* 065 */       /*** CONSUME: Project [_1#0 AS k#3,_2#1 AS v#4] */
      /* 066 */
      /* 067 */       /* input[0, string] */
      /* 068 */       /* input[0, string] */
      /* 069 */       boolean filter_isNull = inputadapter_row.isNullAt(0);
      /* 070 */       UTF8String filter_value = filter_isNull ? null : (inputadapter_row.getUTF8String(0));
      /* 071 */       project_holder.reset();
      /* 072 */
      /* 073 */       project_rowWriter.zeroOutNullBytes();
      /* 074 */
      /* 075 */       if (filter_isNull) {
      /* 076 */         project_rowWriter.setNullAt(0);
      /* 077 */       } else {
      /* 078 */         project_rowWriter.write(0, filter_value);
      /* 079 */       }
      /* 080 */
      /* 081 */       project_rowWriter.write(1, filter_value1);
      /* 082 */       project_result.setTotalSize(project_holder.totalSize());
      /* 083 */       append(project_result.copy());
      /* 084 */     }
      /* 085 */   }
      /* 086 */ }
      ````
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #11684 from kiszk/SPARK-13844.
      4a7636f2
    • Davies Liu's avatar
      Revert "[SPARK-12792] [SPARKR] Refactor RRDD to support R UDF." · e5a1b301
      Davies Liu authored
      This reverts commit 40984f67.
      e5a1b301
    • Sun Rui's avatar
      [SPARK-12792] [SPARKR] Refactor RRDD to support R UDF. · 40984f67
      Sun Rui authored
      Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs.
      
      Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #10947 from sun-rui/SPARK-12792.
      40984f67
    • Liang-Chi Hsieh's avatar
      [SPARK-13742] [CORE] Add non-iterator interface to RandomSampler · 68c0c460
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13742
      
      ## What changes were proposed in this pull request?
      
      `RandomSampler.sample` currently accepts iterator as input and output another iterator. This makes it inappropriate to use in wholestage codegen of `Sampler` operator #11517. This change is to add non-iterator interface to `RandomSampler`.
      
      This change adds a new method `def sample(): Int` to the trait `RandomSampler`. As we don't need to know the actual values of the sampling items, so this new method takes no arguments.
      
      This method will decide whether to sample the next item or not. It returns how many times the next item will be sampled.
      
      For `BernoulliSampler` and `BernoulliCellSampler`, the returned sampling times can only be 0 or 1. It simply means whether to sample the next item or not.
      
      For `PoissonSampler`, the returned value can be more than 1, meaning the next item will be sampled multiple times.
      
      ## How was this patch tested?
      
      Tests are added into `RandomSamplerSuite`.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      Author: Liang-Chi Hsieh <viirya@appier.com>
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #11578 from viirya/random-sampler-no-iterator.
      68c0c460
    • Chenliang Xu's avatar
      [SPARK-14187][MLLIB] Fix incorrect use of binarySearch in SparseMatrix · c8388297
      Chenliang Xu authored
      ## What changes were proposed in this pull request?
      
      Fix incorrect use of binarySearch in SparseMatrix
      
      ## How was this patch tested?
      
      Unit test added.
      
      Author: Chenliang Xu <chexu@groupon.com>
      
      Closes #11992 from luckyrandom/SPARK-14187.
      c8388297
    • Dongjoon Hyun's avatar
      [SPARK-14102][CORE] Block `reset` command in SparkShell · b66aa900
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Spark Shell provides an easy way to use Spark in Scala environment. This PR adds `reset` command to a blocked list, also cleaned up according to the Scala coding style.
      ```scala
      scala> sc
      res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext718fad24
      scala> :reset
      scala> sc
      <console>:11: error: not found: value sc
             sc
             ^
      ```
      If we blocks `reset`, Spark Shell works like the followings.
      ```scala
      scala> :reset
      reset: no such command.  Type :help for help.
      scala> :re
      re is ambiguous: did you mean :replay or :require?
      ```
      
      ## How was this patch tested?
      
      Manual. Run `bin/spark-shell` and type `:reset`.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11920 from dongjoon-hyun/SPARK-14102.
      b66aa900
    • Sean Owen's avatar
      [SPARK-12494][MLLIB] Array out of bound Exception in KMeans Yarn Mode · 7b841540
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Better error message with k-means init can't be enough samples from input (because it is perhaps empty)
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11979 from srowen/SPARK-12494.
      7b841540
    • Kousuke Saruta's avatar
      [SPARK-14185][SQL][MINOR] Make indentation of debug log for generated code proper · aac13fb4
      Kousuke Saruta authored
      ## What changes were proposed in this pull request?
      
      The indentation of debug log output by `CodeGenerator` is weird.
      The first line of the generated code should be put on the next line of the first line of the log message.
      
      ```
      16/03/28 11:10:24 DEBUG CodeGenerator: /* 001 */
      /* 002 */ public java.lang.Object generate(Object[] references) {
      /* 003 */   return new SpecificSafeProjection(references);
      ...
      ```
      
      After this patch is applied, we get debug log like as follows.
      
      ```
      16/03/28 10:45:50 DEBUG CodeGenerator:
      /* 001 */
      /* 002 */ public java.lang.Object generate(Object[] references) {
      /* 003 */   return new SpecificSafeProjection(references);
      ...
      ```
      ## How was this patch tested?
      
      Ran some jobs and checked debug logs.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #11990 from sarutak/fix-debuglog-indentation.
      aac13fb4
  3. Mar 27, 2016
    • Joseph K. Bradley's avatar
      [SPARK-10691][ML] Make LogisticRegressionModel, LinearRegressionModel evaluate() public · 8ef49376
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Made evaluate method public.  Fixed LogisticRegressionModel evaluate to handle case when probabilityCol is not specified.
      
      ## How was this patch tested?
      
      There were already unit tests for these methods.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #11928 from jkbradley/public-evaluate.
      8ef49376
    • Dongjoon Hyun's avatar
      [MINOR][MLLIB] Remove TODO comment DecisionTreeModel.scala · 0f02a5c6
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the following line and the related code. Historically, this code was added in [SPARK-5597](https://issues.apache.org/jira/browse/SPARK-5597). After [SPARK-5597](https://issues.apache.org/jira/browse/SPARK-5597) was committed, [SPARK-3365](https://issues.apache.org/jira/browse/SPARK-3365) is fixed now. Now, we had better remove the comment without changing persistent code.
      
      ```scala
      -        categories: Seq[Double]) { // TODO: Change to List once SPARK-3365 is fixed
      +        categories: Seq[Double]) {
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11966 from dongjoon-hyun/change_categories_type.
      0f02a5c6
    • Dongjoon Hyun's avatar
      [MINOR][SQL] Fix substr/substring testcases. · cfcca732
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the following two testcases in order to test the correct usages.
      ```
      checkSqlGeneration("SELECT substr('This is a test', 'is')")
      checkSqlGeneration("SELECT substring('This is a test', 'is')")
      ```
      
      Actually, the testcases works but tests on exceptional cases.
      ```
      scala> sql("SELECT substr('This is a test', 'is')")
      res0: org.apache.spark.sql.DataFrame = [substring(This is a test, CAST(is AS INT), 2147483647): string]
      
      scala> sql("SELECT substr('This is a test', 'is')").collect()
      res1: Array[org.apache.spark.sql.Row] = Array([null])
      ```
      
      ## How was this patch tested?
      
      Pass the modified unit tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11963 from dongjoon-hyun/fix_substr_testcase.
      cfcca732
Loading