Skip to content
Snippets Groups Projects
  1. Mar 28, 2016
    • Wenchen Fan's avatar
      [SPARK-14205][SQL] remove trait Queryable · 38326cad
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      After DataFrame and Dataset are merged, the trait `Queryable` becomes unnecessary as it has only one implementation. We should remove it.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12001 from cloud-fan/df-ds.
      38326cad
    • Dongjoon Hyun's avatar
      [SPARK-14219][GRAPHX] Fix `pickRandomVertex` not to fall into infinite loops... · 289257c4
      Dongjoon Hyun authored
      [SPARK-14219][GRAPHX] Fix `pickRandomVertex` not to fall into infinite loops for graphs with one vertex
      
      ## What changes were proposed in this pull request?
      
      Currently, `GraphOps.pickRandomVertex()` falls into infinite loops for graphs having only one vertex. This PR fixes it by modifying the following termination-checking condition.
      ```scala
      -      if (selectedVertices.count > 1) {
      +      if (selectedVertices.count > 0) {
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including new test case).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12018 from dongjoon-hyun/SPARK-14219.
      289257c4
    • jerryshao's avatar
      [SPARK-13447][YARN][CORE] Clean the stale states for AM failure and restart situation · 2bc7c96d
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up fix of #9963, in #9963 we handle this stale states clean-up work only for dynamic allocation enabled scenario. Here we should also clean the states in `CoarseGrainedSchedulerBackend` for dynamic allocation disabled scenario.
      
      Please review, CC andrewor14 lianhuiwang , thanks a lot.
      
      ## How was this patch tested?
      
      Run the unit test locally, also with integration test manually.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #11366 from jerryshao/SPARK-13447.
      2bc7c96d
    • jeanlyn's avatar
      [SPARK-13845][CORE] Using onBlockUpdated to replace onTaskEnd avioding driver OOM · ad9e3d50
      jeanlyn authored
      ## What changes were proposed in this pull request?
      
      We have a streaming job using `FlumePollInputStream` always driver OOM after few days, here is some driver heap dump before OOM
      ```
       num     #instances         #bytes  class name
      ----------------------------------------------
         1:      13845916      553836640  org.apache.spark.storage.BlockStatus
         2:      14020324      336487776  org.apache.spark.storage.StreamBlockId
         3:      13883881      333213144  scala.collection.mutable.DefaultEntry
         4:          8907       89043952  [Lscala.collection.mutable.HashEntry;
         5:         62360       65107352  [B
         6:        163368       24453904  [Ljava.lang.Object;
         7:        293651       20342664  [C
      ...
      ```
      `BlockStatus` and `StreamBlockId` keep on growing, and the driver OOM in the end.
      After investigated, i found the `executorIdToStorageStatus` in `StorageStatusListener` seems never remove the blocks from `StorageStatus`.
      In order to fix the issue, i try to use `onBlockUpdated` replace `onTaskEnd ` , so we can update the block informations(add blocks, drop the block from memory to disk and delete the blocks) in time.
      
      ## How was this patch tested?
      
      Existing unit tests and manual tests
      
      Author: jeanlyn <jeanlyn92@gmail.com>
      
      Closes #11779 from jeanlyn/fix_driver_oom.
      ad9e3d50
    • Andrew Or's avatar
      [SPARK-14119][SPARK-14120][SPARK-14122][SQL] Throw exception on unsupported DDL commands · a916d2a4
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Before: We just pass all role commands to Hive even though it doesn't work.
      After: We throw an `AnalysisException` that looks like this:
      
      ```
      scala> sql("CREATE ROLE x")
      org.apache.spark.sql.AnalysisException: Unsupported Hive operation: CREATE ROLE;
        at org.apache.spark.sql.hive.HiveQl$$anonfun$parsePlan$1.apply(HiveQl.scala:213)
        at org.apache.spark.sql.hive.HiveQl$$anonfun$parsePlan$1.apply(HiveQl.scala:208)
        at org.apache.spark.sql.catalyst.parser.CatalystQl.safeParse(CatalystQl.scala:49)
        at org.apache.spark.sql.hive.HiveQl.parsePlan(HiveQl.scala:208)
        at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:198)
      ```
      
      ## How was this patch tested?
      
      `HiveQuerySuite`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #11948 from andrewor14/ddl-role-management.
      a916d2a4
    • Andrew Or's avatar
      [SPARK-14013][SQL] Proper temp function support in catalog · 27aab806
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Session catalog was added in #11750. However, it doesn't really support temporary functions properly; right now we only store the metadata in the form of `CatalogFunction`, but this doesn't make sense for temporary functions because there is no class name.
      
      This patch moves the `FunctionRegistry` into the `SessionCatalog`. With this, the user can call `catalog.createTempFunction` and `catalog.lookupFunction` to use the function they registered previously. This is currently still dead code, however.
      
      ## How was this patch tested?
      
      `SessionCatalogSuite`.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #11972 from andrewor14/temp-functions.
      27aab806
    • Shixiong Zhu's avatar
      [SPARK-14169][CORE] Add UninterruptibleThread · 2f98ee67
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Extract the workaround for HADOOP-10622 introduced by #11940 into UninterruptibleThread so that we can test and reuse it.
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11971 from zsxwing/uninterrupt.
      2f98ee67
    • Reynold Xin's avatar
      [SPARK-14155][SQL] Hide UserDefinedType interface in Spark 2.0 · b7836492
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      UserDefinedType is a developer API in Spark 1.x. With very high probability we will create a new API for user-defined type that also works well with column batches as well as encoders (datasets). In Spark 2.0, let's make `UserDefinedType` `private[spark]` first.
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11955 from rxin/SPARK-14155.
      b7836492
    • Andrew Or's avatar
      [SPARK-13923][SPARK-14014][SQL] Session catalog follow-ups · eebc8c1c
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      This patch addresses the remaining comments left in #11750 and #11918 after they are merged. For a full list of changes in this patch, just trace the commits.
      
      ## How was this patch tested?
      
      `SessionCatalogSuite` and `CatalogTestCases`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12006 from andrewor14/session-catalog-followup.
      eebc8c1c
    • Shixiong Zhu's avatar
      [SPARK-14180][CORE] Fix a deadlock in CoarseGrainedExecutorBackend Shutdown · 34c0638e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Call `executor.stop` in a new thread to eliminate deadlock.
      
      ## How was this patch tested?
      
      Existing unit tests
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #12012 from zsxwing/SPARK-14180.
      34c0638e
    • Herman van Hovell's avatar
      [SPARK-14086][SQL] Add DDL commands to ANTLR4 parser · 328c7116
      Herman van Hovell authored
      #### What changes were proposed in this pull request?
      
      This PR adds all the current Spark SQL DDL commands to the new ANTLR 4 based SQL parser.
      
      I have found a few inconsistencies in the current commands:
      - Function has an alias field. This is actually the class name of the function.
      - Partition specifications should contain nulls in some commands, and contain `None`s in others.
      - `AlterTableSkewedLocation`: Should defines which columns have skewed values, and should allow us to define storage for each skewed combination of values. We currently only allow one value per field.
      - `AlterTableSetFileFormat`: Should only have one file format, it currently supports both.
      
      I have implemented all these comments like they were, and I propose to improve them in follow-up PRs.
      
      #### How was this patch tested?
      
      The existing DDLCommandSuite.
      
      cc rxin andrewor14 yhuai
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #12011 from hvanhovell/SPARK-14086.
      328c7116
    • Xusen Yin's avatar
      [SPARK-11893] Model export/import for spark.ml: TrainValidationSplit · 8c11d1aa
      Xusen Yin authored
      https://issues.apache.org/jira/browse/SPARK-11893
      
      jkbradley In order to share read/write with `TrainValidationSplit`, I move the `SharedReadWrite` out of `CrossValidator` into a new trait `SharedReadWrite` in the tunning package.
      
      To reduce the repeated tests, I move the complex tests from `CrossValidatorSuite` to `SharedReadWriteSuite`, and create a fake validator called `MyValidator` to test the shared code.
      
      With `SharedReadWrite`, potential newly added `Validator` can share the read/write common part, and only need to implement their extra params save/load.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #9971 from yinxusen/SPARK-11893.
      8c11d1aa
    • zero323's avatar
      [SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo… · 39f743a6
      zero323 authored
      ## What changes were proposed in this pull request?
      
      This PR replaces list comprehension in python_full_outer_join.dispatch with a generator expression.
      
      ## How was this patch tested?
      
      PySpark-Core, PySpark-MLlib test suites against Python 2.7, 3.5.
      
      Author: zero323 <matthew.szymkiewicz@gmail.com>
      
      Closes #11998 from zero323/pyspark-join-generator-expr.
      39f743a6
    • nfraison's avatar
      [SPARK-13622][YARN] Issue creating level db for YARN shuffle service · ff3bea38
      nfraison authored
      ## What changes were proposed in this pull request?
      This patch will ensure that we trim all path set in yarn.nodemanager.local-dirs and that the the scheme is well removed so the level db can be created.
      
      ## How was this patch tested?
      manual tests.
      
      Author: nfraison <nfraison@yahoo.fr>
      
      Closes #11475 from ashangit/level_db_creation_issue.
      ff3bea38
    • Yin Huai's avatar
      [SPARK-13713][SQL][TEST-MAVEN] Add Antlr4 maven plugin. · 7007f72b
      Yin Huai authored
      Seems https://github.com/apache/spark/commit/600c0b69cab4767e8e5a6f4284777d8b9d4bd40e is missing the antlr4 maven plugin. This pr adds it.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #12010 from yhuai/mavenAntlr4.
      7007f72b
    • Davies Liu's avatar
      [SPARK-14052] [SQL] build a BytesToBytesMap directly in HashedRelation · d7b58f14
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, for the key that can not fit within a long,  we build a hash map for UnsafeHashedRelation, it's converted to BytesToBytesMap after serialization and deserialization. We should build a BytesToBytesMap directly to have better memory efficiency.
      
      In order to do that, BytesToBytesMap should support multiple (K,V) pair with the same K,  Location.putNewKey() is renamed to Location.append(), which could append multiple values for the same key (same Location). `Location.newValue()` is added to find the next value for the same key.
      
      ## How was this patch tested?
      
      Existing tests. Added benchmark for broadcast hash join with duplicated keys.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11870 from davies/map2.
      d7b58f14
    • Herman van Hovell's avatar
      [SPARK-13713][SQL] Migrate parser from ANTLR3 to ANTLR4 · 600c0b69
      Herman van Hovell authored
      ### What changes were proposed in this pull request?
      The current ANTLR3 parser is quite complex to maintain and suffers from code blow-ups. This PR introduces a new parser that is based on ANTLR4.
      
      This parser is based on the [Presto's SQL parser](https://github.com/facebook/presto/blob/master/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4). The current implementation can parse and create Catalyst and SQL plans. Large parts of the HiveQl DDL and some of the DML functionality is currently missing, the plan is to add this in follow-up PRs.
      
      This PR is a work in progress, and work needs to be done in the following area's:
      
      - [x] Error handling should be improved.
      - [x] Documentation should be improved.
      - [x] Multi-Insert needs to be tested.
      - [ ] Naming and package locations.
      
      ### How was this patch tested?
      
      Catalyst and SQL unit tests.
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #11557 from hvanhovell/ngParser.
      600c0b69
    • Liang-Chi Hsieh's avatar
      [SPARK-14156][SQL] Use executedPlan in HiveComparisonTest for the messages of computed tables · 1528ff4c
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      JIRA: https://issues.apache.org/jira/browse/SPARK-14156
      
      In HiveComparisonTest, when catalyst results are different to hive results, we will collect the messages for computed tables during the test. During creating the message, we use sparkPlan. But we actually run the query with executedPlan. So the error message is sometimes confusing.
      
      For example, as wholestage codegen is enabled by default now. The shown spark plan for computed tables is the plan before wholestage codegen.
      
      A concrete is the following error message shown before this patch. It is the error shown when running `HiveCompatibilityTest` `auto_join26`.
      
      auto_join26 has one SQL to create table:
      
          INSERT OVERWRITE TABLE dest_j1
          SELECT  x.key, count(1) FROM src1 x JOIN src y ON (x.key = y.key) group by x.key;   (1)
      
      Then a SQL to retrieve the result:
      
          select * from dest_j1 x order by x.key;   (2)
      
      When the above SQL (2) to retrieve the result fails, In `HiveComparisonTest` we will try to collect and show the generated data from table `dest_j1` using the SQL (1)'s spark plan. The you will see this error:
      
          TungstenAggregate(key=[key#8804], functions=[(count(1),mode=Partial,isDistinct=false)], output=[key#8804,count#8834L])
          +- Project [key#8804]
             +- BroadcastHashJoin [key#8804], [key#8806], Inner, BuildRight, None
                :- Filter isnotnull(key#8804)
                :  +- InMemoryColumnarTableScan [key#8804], [isnotnull(key#8804)], InMemoryRelation [key#8804,value#8805], true, 5, StorageLevel(true, true, false, true, 1), HiveTableScan [key#8717,value#8718], MetastoreRelation default, src1, None, Some(src1)
                +- Filter isnotnull(key#8806)
                   +- InMemoryColumnarTableScan [key#8806], [isnotnull(key#8806)], InMemoryRelation [key#8806,value#8807], true, 5, StorageLevel(true, true, false, true, 1), HiveTableScan [key#8760,value#8761], MetastoreRelation default, src, None, Some(src)
      
      	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
      	at org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:82)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:121)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:121)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:140)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:137)
      	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:120)
      	at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:87)
      	at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:82)
      	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
      	... 70 more
          Caused by: java.lang.UnsupportedOperationException: Filter does not implement doExecuteBroadcast
      	at org.apache.spark.sql.execution.SparkPlan.doExecuteBroadcast(SparkPlan.scala:221)
      
      The message is confusing because it is not the plan actually run by SparkSQL engine to create the generated table. The plan actually run is no problem. But as before this patch, we run `e.sparkPlan.collect` to retrieve and show the generated data, spark plan is not the plan we can run. So the above error will be shown.
      
      After this patch, we won't see the error because the executed plan is no problem and works.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #11957 from viirya/use-executedplan.
      1528ff4c
    • Kazuaki Ishizaki's avatar
      [SPARK-13844] [SQL] Generate better code for filters with a non-nullable column · 4a7636f2
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR simplifies generated code with a non-nullable column. This PR addresses three items:
      1. Generate simplified code for and / or
      2. Generate better code for divide and remainder with non-zero dividend
      3. Pass nullable information into BoundReference at WholeStageCodegen
      
      I have attached the generated code with and without this PR
      
      ## How was this patch tested?
      
      Tested by existing test suites in sql/core
      
      Here is a motivating example
      ````
      (0 to 6).map(i => (i.toString, i.toInt)).toDF("k", "v")
        .filter("v % 2 == 0").filter("v <= 4").filter("v > 1").show()
      ````
      
      Generated code without this PR
      ````java
      /* 032 */   protected void processNext() throws java.io.IOException {
      /* 033 */     /*** PRODUCE: Project [_1#0 AS k#3,_2#1 AS v#4] */
      /* 034 */
      /* 035 */     /*** PRODUCE: Filter ((isnotnull((_2#1 % 2)) && ((_2#1 % 2) = 0)) && ((_2#1 <= 4) && (_2#1 > 1))) */
      /* 036 */
      /* 037 */     /*** PRODUCE: INPUT */
      /* 038 */
      /* 039 */     while (!shouldStop() && inputadapter_input.hasNext()) {
      /* 040 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
      /* 041 */       /*** CONSUME: Filter ((isnotnull((_2#1 % 2)) && ((_2#1 % 2) = 0)) && ((_2#1 <= 4) && (_2#1 > 1))) */
      /* 042 */       /* input[1, int] */
      /* 043 */       int filter_value1 = inputadapter_row.getInt(1);
      /* 044 */
      /* 045 */       /* isnotnull((input[1, int] % 2)) */
      /* 046 */       /* (input[1, int] % 2) */
      /* 047 */       boolean filter_isNull3 = false;
      /* 048 */       int filter_value3 = -1;
      /* 049 */       if (false || 2 == 0) {
      /* 050 */         filter_isNull3 = true;
      /* 051 */       } else {
      /* 052 */         if (false) {
      /* 053 */           filter_isNull3 = true;
      /* 054 */         } else {
      /* 055 */           filter_value3 = (int)(filter_value1 % 2);
      /* 056 */         }
      /* 057 */       }
      /* 058 */       if (!(!(filter_isNull3))) continue;
      /* 059 */
      /* 060 */       /* ((input[1, int] % 2) = 0) */
      /* 061 */       boolean filter_isNull6 = true;
      /* 062 */       boolean filter_value6 = false;
      /* 063 */       /* (input[1, int] % 2) */
      /* 064 */       boolean filter_isNull7 = false;
      /* 065 */       int filter_value7 = -1;
      /* 066 */       if (false || 2 == 0) {
      /* 067 */         filter_isNull7 = true;
      /* 068 */       } else {
      /* 069 */         if (false) {
      /* 070 */           filter_isNull7 = true;
      /* 071 */         } else {
      /* 072 */           filter_value7 = (int)(filter_value1 % 2);
      /* 073 */         }
      /* 074 */       }
      /* 075 */       if (!filter_isNull7) {
      /* 076 */         filter_isNull6 = false; // resultCode could change nullability.
      /* 077 */         filter_value6 = filter_value7 == 0;
      /* 078 */
      /* 079 */       }
      /* 080 */       if (filter_isNull6 || !filter_value6) continue;
      /* 081 */
      /* 082 */       /* (input[1, int] <= 4) */
      /* 083 */       boolean filter_value11 = false;
      /* 084 */       filter_value11 = filter_value1 <= 4;
      /* 085 */       if (!filter_value11) continue;
      /* 086 */
      /* 087 */       /* (input[1, int] > 1) */
      /* 088 */       boolean filter_value14 = false;
      /* 089 */       filter_value14 = filter_value1 > 1;
      /* 090 */       if (!filter_value14) continue;
      /* 091 */
      /* 092 */       filter_metricValue.add(1);
      /* 093 */
      /* 094 */       /*** CONSUME: Project [_1#0 AS k#3,_2#1 AS v#4] */
      /* 095 */
      /* 096 */       /* input[0, string] */
      /* 097 */       /* input[0, string] */
      /* 098 */       boolean filter_isNull = inputadapter_row.isNullAt(0);
      /* 099 */       UTF8String filter_value = filter_isNull ? null : (inputadapter_row.getUTF8String(0));
      /* 100 */       project_holder.reset();
      /* 101 */
      /* 102 */       project_rowWriter.zeroOutNullBytes();
      /* 103 */
      /* 104 */       if (filter_isNull) {
      /* 105 */         project_rowWriter.setNullAt(0);
      /* 106 */       } else {
      /* 107 */         project_rowWriter.write(0, filter_value);
      /* 108 */       }
      /* 109 */
      /* 110 */       project_rowWriter.write(1, filter_value1);
      /* 111 */       project_result.setTotalSize(project_holder.totalSize());
      /* 112 */       append(project_result.copy());
      /* 113 */     }
      /* 114 */   }
      /* 115 */ }
      ````
      
      Generated code with this PR
      ````java
      /* 032 */   protected void processNext() throws java.io.IOException {
      /* 033 */     /*** PRODUCE: Project [_1#0 AS k#3,_2#1 AS v#4] */
      /* 034 */
      /* 035 */     /*** PRODUCE: Filter (((_2#1 % 2) = 0) && ((_2#1 <= 5) && (_2#1 > 1))) */
      /* 036 */
      /* 037 */     /*** PRODUCE: INPUT */
      /* 038 */
      /* 039 */     while (!shouldStop() && inputadapter_input.hasNext()) {
      /* 040 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
      /* 041 */       /*** CONSUME: Filter (((_2#1 % 2) = 0) && ((_2#1 <= 5) && (_2#1 > 1))) */
      /* 042 */       /* input[1, int] */
      /* 043 */       int filter_value1 = inputadapter_row.getInt(1);
      /* 044 */
      /* 045 */       /* ((input[1, int] % 2) = 0) */
      /* 046 */       /* (input[1, int] % 2) */
      /* 047 */       int filter_value3 = (int)(filter_value1 % 2);
      /* 048 */
      /* 049 */       boolean filter_value2 = false;
      /* 050 */       filter_value2 = filter_value3 == 0;
      /* 051 */       if (!filter_value2) continue;
      /* 052 */
      /* 053 */       /* (input[1, int] <= 5) */
      /* 054 */       boolean filter_value7 = false;
      /* 055 */       filter_value7 = filter_value1 <= 5;
      /* 056 */       if (!filter_value7) continue;
      /* 057 */
      /* 058 */       /* (input[1, int] > 1) */
      /* 059 */       boolean filter_value10 = false;
      /* 060 */       filter_value10 = filter_value1 > 1;
      /* 061 */       if (!filter_value10) continue;
      /* 062 */
      /* 063 */       filter_metricValue.add(1);
      /* 064 */
      /* 065 */       /*** CONSUME: Project [_1#0 AS k#3,_2#1 AS v#4] */
      /* 066 */
      /* 067 */       /* input[0, string] */
      /* 068 */       /* input[0, string] */
      /* 069 */       boolean filter_isNull = inputadapter_row.isNullAt(0);
      /* 070 */       UTF8String filter_value = filter_isNull ? null : (inputadapter_row.getUTF8String(0));
      /* 071 */       project_holder.reset();
      /* 072 */
      /* 073 */       project_rowWriter.zeroOutNullBytes();
      /* 074 */
      /* 075 */       if (filter_isNull) {
      /* 076 */         project_rowWriter.setNullAt(0);
      /* 077 */       } else {
      /* 078 */         project_rowWriter.write(0, filter_value);
      /* 079 */       }
      /* 080 */
      /* 081 */       project_rowWriter.write(1, filter_value1);
      /* 082 */       project_result.setTotalSize(project_holder.totalSize());
      /* 083 */       append(project_result.copy());
      /* 084 */     }
      /* 085 */   }
      /* 086 */ }
      ````
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #11684 from kiszk/SPARK-13844.
      4a7636f2
    • Davies Liu's avatar
      Revert "[SPARK-12792] [SPARKR] Refactor RRDD to support R UDF." · e5a1b301
      Davies Liu authored
      This reverts commit 40984f67.
      e5a1b301
    • Sun Rui's avatar
      [SPARK-12792] [SPARKR] Refactor RRDD to support R UDF. · 40984f67
      Sun Rui authored
      Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs.
      
      Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #10947 from sun-rui/SPARK-12792.
      40984f67
    • Liang-Chi Hsieh's avatar
      [SPARK-13742] [CORE] Add non-iterator interface to RandomSampler · 68c0c460
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13742
      
      ## What changes were proposed in this pull request?
      
      `RandomSampler.sample` currently accepts iterator as input and output another iterator. This makes it inappropriate to use in wholestage codegen of `Sampler` operator #11517. This change is to add non-iterator interface to `RandomSampler`.
      
      This change adds a new method `def sample(): Int` to the trait `RandomSampler`. As we don't need to know the actual values of the sampling items, so this new method takes no arguments.
      
      This method will decide whether to sample the next item or not. It returns how many times the next item will be sampled.
      
      For `BernoulliSampler` and `BernoulliCellSampler`, the returned sampling times can only be 0 or 1. It simply means whether to sample the next item or not.
      
      For `PoissonSampler`, the returned value can be more than 1, meaning the next item will be sampled multiple times.
      
      ## How was this patch tested?
      
      Tests are added into `RandomSamplerSuite`.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      Author: Liang-Chi Hsieh <viirya@appier.com>
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #11578 from viirya/random-sampler-no-iterator.
      68c0c460
    • Chenliang Xu's avatar
      [SPARK-14187][MLLIB] Fix incorrect use of binarySearch in SparseMatrix · c8388297
      Chenliang Xu authored
      ## What changes were proposed in this pull request?
      
      Fix incorrect use of binarySearch in SparseMatrix
      
      ## How was this patch tested?
      
      Unit test added.
      
      Author: Chenliang Xu <chexu@groupon.com>
      
      Closes #11992 from luckyrandom/SPARK-14187.
      c8388297
    • Dongjoon Hyun's avatar
      [SPARK-14102][CORE] Block `reset` command in SparkShell · b66aa900
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Spark Shell provides an easy way to use Spark in Scala environment. This PR adds `reset` command to a blocked list, also cleaned up according to the Scala coding style.
      ```scala
      scala> sc
      res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext718fad24
      scala> :reset
      scala> sc
      <console>:11: error: not found: value sc
             sc
             ^
      ```
      If we blocks `reset`, Spark Shell works like the followings.
      ```scala
      scala> :reset
      reset: no such command.  Type :help for help.
      scala> :re
      re is ambiguous: did you mean :replay or :require?
      ```
      
      ## How was this patch tested?
      
      Manual. Run `bin/spark-shell` and type `:reset`.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11920 from dongjoon-hyun/SPARK-14102.
      b66aa900
    • Sean Owen's avatar
      [SPARK-12494][MLLIB] Array out of bound Exception in KMeans Yarn Mode · 7b841540
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Better error message with k-means init can't be enough samples from input (because it is perhaps empty)
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11979 from srowen/SPARK-12494.
      7b841540
    • Kousuke Saruta's avatar
      [SPARK-14185][SQL][MINOR] Make indentation of debug log for generated code proper · aac13fb4
      Kousuke Saruta authored
      ## What changes were proposed in this pull request?
      
      The indentation of debug log output by `CodeGenerator` is weird.
      The first line of the generated code should be put on the next line of the first line of the log message.
      
      ```
      16/03/28 11:10:24 DEBUG CodeGenerator: /* 001 */
      /* 002 */ public java.lang.Object generate(Object[] references) {
      /* 003 */   return new SpecificSafeProjection(references);
      ...
      ```
      
      After this patch is applied, we get debug log like as follows.
      
      ```
      16/03/28 10:45:50 DEBUG CodeGenerator:
      /* 001 */
      /* 002 */ public java.lang.Object generate(Object[] references) {
      /* 003 */   return new SpecificSafeProjection(references);
      ...
      ```
      ## How was this patch tested?
      
      Ran some jobs and checked debug logs.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #11990 from sarutak/fix-debuglog-indentation.
      aac13fb4
  2. Mar 27, 2016
    • Joseph K. Bradley's avatar
      [SPARK-10691][ML] Make LogisticRegressionModel, LinearRegressionModel evaluate() public · 8ef49376
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Made evaluate method public.  Fixed LogisticRegressionModel evaluate to handle case when probabilityCol is not specified.
      
      ## How was this patch tested?
      
      There were already unit tests for these methods.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #11928 from jkbradley/public-evaluate.
      8ef49376
    • Dongjoon Hyun's avatar
      [MINOR][MLLIB] Remove TODO comment DecisionTreeModel.scala · 0f02a5c6
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the following line and the related code. Historically, this code was added in [SPARK-5597](https://issues.apache.org/jira/browse/SPARK-5597). After [SPARK-5597](https://issues.apache.org/jira/browse/SPARK-5597) was committed, [SPARK-3365](https://issues.apache.org/jira/browse/SPARK-3365) is fixed now. Now, we had better remove the comment without changing persistent code.
      
      ```scala
      -        categories: Seq[Double]) { // TODO: Change to List once SPARK-3365 is fixed
      +        categories: Seq[Double]) {
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11966 from dongjoon-hyun/change_categories_type.
      0f02a5c6
    • Dongjoon Hyun's avatar
      [MINOR][SQL] Fix substr/substring testcases. · cfcca732
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the following two testcases in order to test the correct usages.
      ```
      checkSqlGeneration("SELECT substr('This is a test', 'is')")
      checkSqlGeneration("SELECT substring('This is a test', 'is')")
      ```
      
      Actually, the testcases works but tests on exceptional cases.
      ```
      scala> sql("SELECT substr('This is a test', 'is')")
      res0: org.apache.spark.sql.DataFrame = [substring(This is a test, CAST(is AS INT), 2147483647): string]
      
      scala> sql("SELECT substr('This is a test', 'is')").collect()
      res1: Array[org.apache.spark.sql.Row] = Array([null])
      ```
      
      ## How was this patch tested?
      
      Pass the modified unit tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11963 from dongjoon-hyun/fix_substr_testcase.
      cfcca732
  3. Mar 26, 2016
    • gatorsmile's avatar
      [SPARK-14177][SQL] Native Parsing for DDL Command "Describe Database" and "Alter Database" · a01b6a92
      gatorsmile authored
      #### What changes were proposed in this pull request?
      
      This PR is to provide native parsing support for two DDL commands:  ```Describe Database``` and ```Alter Database Set Properties```
      
      Based on the Hive DDL document:
      https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
      
      ##### 1. ALTER DATABASE
      **Syntax:**
      ```SQL
      ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...)
      ```
       - `ALTER DATABASE` is to add new (key, value) pairs into `DBPROPERTIES`
      
      ##### 2. DESCRIBE DATABASE
      **Syntax:**
      ```SQL
      DESCRIBE DATABASE [EXTENDED] db_name
      ```
       - `DESCRIBE DATABASE` shows the name of the database, its comment (if one has been set), and its root location on the filesystem. When `extended` is true, it also shows the database's properties
      
      #### How was this patch tested?
      Added the related test cases to `DDLCommandSuite`
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      This patch had conflicts when merged, resolved by
      Committer: Yin Huai <yhuai@databricks.com>
      
      Closes #11977 from gatorsmile/parseAlterDatabase.
      a01b6a92
    • Liang-Chi Hsieh's avatar
      [SPARK-14157][SQL] Parse Drop Function DDL command · bc925b73
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      JIRA: https://issues.apache.org/jira/browse/SPARK-14157
      
      We only parse create function command. In order to support native drop function command, we need to parse it too.
      
      From Hive [manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/ReloadFunction), the drop function command has syntax as:
      
      DROP [TEMPORARY] FUNCTION [IF EXISTS] function_name;
      
      ## How was this patch tested?
      
      Added test into `DDLCommandSuite`.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #11959 from viirya/parse-drop-func.
      bc925b73
    • Cheng Lian's avatar
      [SPARK-14116][SQL] Implements buildReader() for ORC data source · b547de8a
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR implements `FileFormat.buildReader()` for our ORC data source. It also fixed several minor styling issues related to `HadoopFsRelation` planning code path.
      
      Note that `OrcNewInputFormat` doesn't rely on `OrcNewSplit` for creating `OrcRecordReader`s, plain `FileSplit` is just fine. That's why we can simply create the record reader with the help of `OrcNewInputFormat` and `FileSplit`.
      
      ## How was this patch tested?
      
      Existing test cases should do the work
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #11936 from liancheng/spark-14116-build-reader-for-orc.
      b547de8a
    • gatorsmile's avatar
      [SPARK-14161][SQL] Native Parsing for DDL Command Drop Database · 8989d3a3
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Based on the Hive DDL document https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
      
      The syntax of DDL command for Drop Database is
      ```SQL
      DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
      ```
       - If `IF EXISTS` is not specified, the default behavior is to issue a warning message if `database_name` does't exist
       - `RESTRICT` is the default behavior.
      
      This PR is to provide a native parsing support for `DROP DATABASE`.
      
      #### How was this patch tested?
      
      Added a test case `DDLCommandSuite`
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11962 from gatorsmile/parseDropDatabase.
      8989d3a3
    • Josh Rosen's avatar
      [SPARK-14135] Add off-heap storage memory bookkeeping support to MemoryManager · 20c0bcd9
      Josh Rosen authored
      This patch extends Spark's `UnifiedMemoryManager` to add bookkeeping support for off-heap storage memory, an requirement for enabling off-heap caching (which will be done by #11805). The `MemoryManager`'s `storageMemoryPool` has been split into separate on- and off-heap pools and the storage and unroll memory allocation methods have been updated to accept a `memoryMode` parameter to specify whether allocations should be performed on- or off-heap.
      
      In order to reduce the testing surface, the `StaticMemoryManager` does not support off-heap caching (we plan to eventually remove the `StaticMemoryManager`, so this isn't a significant limitation).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11942 from JoshRosen/off-heap-storage-memory-bookkeeping.
      20c0bcd9
    • Davies Liu's avatar
      [SPARK-14175][SQL] whole stage codegen interface refactor · bd94ea4c
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      1. merge consumeChild into consume()
      2. always generate code for input variables and UnsafeRow, a plan can use eight of them.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11975 from davies/gen_refactor.
      bd94ea4c
    • Rekha Joshi's avatar
      [SPARK-13973][PYSPARK] ipython notebook` is going away · a91784fb
      Rekha Joshi authored
      ## What changes were proposed in this pull request?
      https://issues.apache.org/jira/browse/SPARK-13973
      
      ## How was this patch tested?
      Pyspark
      
      Author: Rekha Joshi <rekhajoshm@gmail.com>
      Author: Joshi <rekhajoshm@gmail.com>
      
      Closes #11829 from rekhajoshm/SPARK-13973.
      a91784fb
    • Liwei Lin's avatar
      [SPARK-14089][CORE][MLLIB] Remove methods that has been deprecated since 1.1,... · 62a85eb0
      Liwei Lin authored
      [SPARK-14089][CORE][MLLIB] Remove methods that has been deprecated since 1.1, 1.2, 1.3, 1.4, and 1.5
      
      ## What changes were proposed in this pull request?
      
      Removed methods that has been deprecated since 1.1, 1.2, 1.3, 1.4, and 1.5.
      
      ## How was this patch tested?
      
      - manully checked that no codes in Spark call these methods any more
      - existing test suits
      
      Author: Liwei Lin <lwlin7@gmail.com>
      Author: proflin <proflin.me@gmail.com>
      
      Closes #11910 from lw-lin/remove-deprecates.
      62a85eb0
    • Dongjoon Hyun's avatar
      [MINOR] Fix newly added java-lint errors · 18084658
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes some newly added java-lint errors(unused-imports, line-lengsth).
      
      ## How was this patch tested?
      
      Pass the Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11968 from dongjoon-hyun/SPARK-14167.
      18084658
    • Shixiong Zhu's avatar
      [SPARK-13874][DOC] Remove docs of streaming-akka, streaming-zeromq,... · d23ad7c1
      Shixiong Zhu authored
      [SPARK-13874][DOC] Remove docs of streaming-akka, streaming-zeromq, streaming-mqtt and streaming-twitter
      
      ## What changes were proposed in this pull request?
      
      This PR removes all docs about the old streaming-akka, streaming-zeromq, streaming-mqtt and streaming-twitter projects since I have already copied them to https://github.com/spark-packages
      
      Also remove mqtt_wordcount.py that I forgot to remove previously.
      
      ## How was this patch tested?
      
      Jenkins PR Build.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11824 from zsxwing/remove-doc.
      d23ad7c1
  4. Mar 25, 2016
    • Tathagata Das's avatar
      [SPARK-14109][SQL] Fix HDFSMetadataLog to fallback from FileContext to FileSystem API · 13945dd8
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      HDFSMetadataLog uses newer FileContext API to achieve atomic renaming. However, FileContext implementations may not exist for many scheme for which there may be FileSystem implementations. In those cases, rather than failing completely, we should fallback to the FileSystem based implementation, and log warning that there may be file consistency issues in case the log directory is concurrently modified.
      
      In addition I have also added more tests to increase the code coverage.
      
      ## How was this patch tested?
      
      Unit test.
      Tested on cluster with custom file system.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #11925 from tdas/SPARK-14109.
      13945dd8
Loading