Skip to content
Snippets Groups Projects
  1. Nov 10, 2016
  2. Nov 09, 2016
    • Wenchen Fan's avatar
      [SPARK-18147][SQL] do not fail for very complex aggregator result type · 8c489a78
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      ~In `TypedAggregateExpression.evaluateExpression`, we may create `ReferenceToExpressions` with `CreateStruct`, and `CreateStruct` may generate too many codes and split them into several methods.  `ReferenceToExpressions` will replace `BoundReference` in `CreateStruct` with `LambdaVariable`, which can only be used as local variables and doesn't work if we split the generated code.~
      
      It's already fixed by #15693 , this pr adds regression test
      
      ## How was this patch tested?
      
      new test in `DatasetAggregatorSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15807 from cloud-fan/typed-agg.
      
      (cherry picked from commit 6021c95a)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      8c489a78
    • Tyson Condie's avatar
      [SPARK-17829][SQL] Stable format for offset log · b7d29256
      Tyson Condie authored
      ## What changes were proposed in this pull request?
      
      Currently we use java serialization for the WAL that stores the offsets contained in each batch. This has two main issues:
      It can break across spark releases (though this is not the only thing preventing us from upgrading a running query)
      It is unnecessarily opaque to the user.
      I'd propose we require offsets to provide a user readable serialization and use that instead. JSON is probably a good option.
      ## How was this patch tested?
      
      Tests were added for KafkaSourceOffset in [KafkaSourceOffsetSuite](external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala) and for LongOffset in [OffsetSuite](sql/core/src/test/scala/org/apache/spark/sql/streaming/OffsetSuite.scala)
      
      Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
      
       before opening a pull request.
      
      zsxwing marmbrus
      
      Author: Tyson Condie <tcondie@gmail.com>
      Author: Tyson Condie <tcondie@clash.local>
      
      Closes #15626 from tcondie/spark-8360.
      
      (cherry picked from commit 3f62e1b5)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      b7d29256
    • Herman van Hovell's avatar
      [SPARK-18370][SQL] Add table information to InsertIntoHadoopFsRelationCommand · 4424c901
      Herman van Hovell authored
      
      ## What changes were proposed in this pull request?
      `InsertIntoHadoopFsRelationCommand` does not keep track if it inserts into a table and what table it inserts to. This can make debugging these statements problematic. This PR adds table information the `InsertIntoHadoopFsRelationCommand`. Explaining this SQL command `insert into prq select * from range(0, 100000)` now yields the following executed plan:
      ```
      == Physical Plan ==
      ExecutedCommand
         +- InsertIntoHadoopFsRelationCommand file:/dev/assembly/spark-warehouse/prq, ParquetFormat, <function1>, Map(serialization.format -> 1, path -> file:/dev/assembly/spark-warehouse/prq), Append, CatalogTable(
      	Table: `default`.`prq`
      	Owner: hvanhovell
      	Created: Wed Nov 09 17:42:30 CET 2016
      	Last Access: Thu Jan 01 01:00:00 CET 1970
      	Type: MANAGED
      	Schema: [StructField(id,LongType,true)]
      	Provider: parquet
      	Properties: [transient_lastDdlTime=1478709750]
      	Storage(Location: file:/dev/assembly/spark-warehouse/prq, InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat, Serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, Properties: [serialization.format=1]))
               +- Project [id#7L]
                  +- Range (0, 100000, step=1, splits=None)
      ```
      
      ## How was this patch tested?
      Added extra checks to the `ParquetMetastoreSuite`
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #15832 from hvanhovell/SPARK-18370.
      
      (cherry picked from commit d8b81f77)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      4424c901
    • Ryan Blue's avatar
      [SPARK-18368][SQL] Fix regexp replace when serialized · 80f58510
      Ryan Blue authored
      
      ## What changes were proposed in this pull request?
      
      This makes the result value both transient and lazy, so that if the RegExpReplace object is initialized then serialized, `result: StringBuffer` will be correctly initialized.
      
      ## How was this patch tested?
      
      * Verified that this patch fixed the query that found the bug.
      * Added a test case that fails without the fix.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #15834 from rdblue/SPARK-18368-fix-regexp-replace.
      
      (cherry picked from commit d4028de9)
      Signed-off-by: default avatarYin Huai <yhuai@databricks.com>
      80f58510
    • Yin Huai's avatar
      Revert "[SPARK-18368] Fix regexp_replace with task serialization." · 626f6d6d
      Yin Huai authored
      This reverts commit b9192bb3.
      626f6d6d
    • Vinayak's avatar
      [SPARK-16808][CORE] History Server main page does not honor APPLICATION_WEB_PROXY_BASE · 5bd31dc9
      Vinayak authored
      
      ## What changes were proposed in this pull request?
      
      Application links generated on the history server UI no longer (regression from 1.6) contain the configured spark.ui.proxyBase in the links. To address this, made the uiRoot available globally to all javascripts for Web UI. Updated the mustache template (historypage-template.html) to include the uiroot for rendering links to the applications.
      
      The existing test was not sufficient to verify the scenario where ajax call is used to populate the application listing template, so added a new selenium test case to cover this scenario.
      
      ## How was this patch tested?
      
      Existing tests and a new unit test.
      No visual changes to the UI.
      
      Author: Vinayak <vijoshi5@in.ibm.com>
      
      Closes #15742 from vijoshi/SPARK-16808_master.
      
      (cherry picked from commit 06a13ecc)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      5bd31dc9
    • Dongjoon Hyun's avatar
      [SPARK-18292][SQL] LogicalPlanToSQLSuite should not use resource dependent... · ac441d17
      Dongjoon Hyun authored
      [SPARK-18292][SQL] LogicalPlanToSQLSuite should not use resource dependent path for golden file generation
      
      ## What changes were proposed in this pull request?
      
      `LogicalPlanToSQLSuite` uses the following command to update the existing answer files.
      
      ```bash
      SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "hive/test-only *LogicalPlanToSQLSuite"
      ```
      
      However, after introducing `getTestResourcePath`, it fails to update the previous golden answer files in the predefined directory. This issue aims to fix that.
      
      ## How was this patch tested?
      
      It's a testsuite update. Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15789 from dongjoon-hyun/SPARK-18292.
      
      (cherry picked from commit 02c5325b)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      ac441d17
    • gatorsmile's avatar
      [SPARK-17659][SQL] Partitioned View is Not Supported By SHOW CREATE TABLE · b89c38b2
      gatorsmile authored
      
      ### What changes were proposed in this pull request?
      
      `Partitioned View` is not supported by SPARK SQL. For Hive partitioned view, SHOW CREATE TABLE is unable to generate the right DDL. Thus, SHOW CREATE TABLE should not support it like the other Hive-only features. This PR is to issue an exception when detecting the view is a partitioned view.
      ### How was this patch tested?
      
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15233 from gatorsmile/partitionedView.
      
      (cherry picked from commit e256392a)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      b89c38b2
    • Ryan Blue's avatar
      [SPARK-18368] Fix regexp_replace with task serialization. · f6720836
      Ryan Blue authored
      
      ## What changes were proposed in this pull request?
      
      This makes the result value both transient and lazy, so that if the RegExpReplace object is initialized then serialized, `result: StringBuffer` will be correctly initialized.
      
      ## How was this patch tested?
      
      * Verified that this patch fixed the query that found the bug.
      * Added a test case that fails without the fix.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #15816 from rdblue/SPARK-18368-fix-regexp-replace.
      
      (cherry picked from commit b9192bb3)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      f6720836
    • Eric Liang's avatar
      [SPARK-18333][SQL] Revert hacks in parquet and orc reader to support case insensitive resolution · 0dc14f12
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      These are no longer needed after https://issues.apache.org/jira/browse/SPARK-17183
      
      
      
      cc cloud-fan
      
      ## How was this patch tested?
      
      Existing parquet and orc tests.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15799 from ericl/sc-4929.
      
      (cherry picked from commit 4afa39e2)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      0dc14f12
  3. Nov 08, 2016
    • Felix Cheung's avatar
      [SPARK-18239][SPARKR] Gradient Boosted Tree for R · 98dd7ac7
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      Gradient Boosted Tree in R.
      With a few minor improvements to RandomForest in R.
      
      Since this is relatively isolated I'd like to target this for branch-2.1
      
      ## How was this patch tested?
      
      manual tests, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15746 from felixcheung/rgbt.
      
      (cherry picked from commit 55964c15)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      98dd7ac7
    • Burak Yavuz's avatar
      [SPARK-18342] Make rename failures fatal in HDFSBackedStateStore · 988f9080
      Burak Yavuz authored
      
      ## What changes were proposed in this pull request?
      
      If the rename operation in the state store fails (`fs.rename` returns `false`), the StateStore should throw an exception and have the task retry. Currently if renames fail, nothing happens during execution immediately. However, you will observe that snapshot operations will fail, and then any attempt at recovery (executor failure / checkpoint recovery) also fails.
      
      ## How was this patch tested?
      
      Unit test
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #15804 from brkyvz/rename-state.
      
      (cherry picked from commit 6f7ecb0f)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      988f9080
    • Shixiong Zhu's avatar
      [SPARK-18280][CORE] Fix potential deadlock in `StandaloneSchedulerBackend.dead` · ba80eaf7
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      "StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not call "SparkContext.stop" in the same thread. "SparkContext.stop" will block until all RPC threads exit, if it's called inside a RPC thread, it will be dead-lock.
      
      This PR add a thread local flag inside RPC threads. `SparkContext.stop` uses it to decide if launching a new thread to stop the SparkContext.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15775 from zsxwing/SPARK-18280.
      ba80eaf7
    • Joseph K. Bradley's avatar
      [SPARK-17748][ML] Minor cleanups to one-pass linear regression with elastic net · 21bbf94b
      Joseph K. Bradley authored
      
      ## What changes were proposed in this pull request?
      
      * Made SingularMatrixException private ml
      * WeightedLeastSquares: Changed to allow tol >= 0 instead of only tol > 0
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #15779 from jkbradley/wls-cleanups.
      
      (cherry picked from commit 26e1c53a)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      21bbf94b
    • Kishor Patil's avatar
      [SPARK-18357] Fix yarn files/archive broken issue andd unit tests · 876eee2b
      Kishor Patil authored
      
      ## What changes were proposed in this pull request?
      
      The #15627 broke functionality with yarn --files --archives does not accept any files.
      This patch ensures that --files and --archives accept unique files.
      
      ## How was this patch tested?
      
      A. I added unit tests.
      B. Also, manually tested --files with --archives to throw exception if duplicate files are specified and continue if unique files are specified.
      
      Author: Kishor Patil <kpatil@yahoo-inc.com>
      
      Closes #15810 from kishorvpatil/SPARK18357.
      
      (cherry picked from commit 245e5a2f)
      Signed-off-by: default avatarTom Graves <tgraves@yahoo-inc.com>
      876eee2b
    • Wenchen Fan's avatar
      [SPARK-18346][SQL] TRUNCATE TABLE should fail if no partition is matched for... · 9595a710
      Wenchen Fan authored
      [SPARK-18346][SQL] TRUNCATE TABLE should fail if no partition is matched for the given non-partial partition spec
      
      ## What changes were proposed in this pull request?
      
      a follow up of https://github.com/apache/spark/pull/15688
      
      
      
      ## How was this patch tested?
      
      updated test in `DDLSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15805 from cloud-fan/truncate.
      
      (cherry picked from commit 73feaa30)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      9595a710
    • chie8842's avatar
      [SPARK-13770][DOCUMENTATION][ML] Document the ML feature Interaction · ef6b6d3d
      chie8842 authored
      
      I created Scala and Java example and added documentation.
      
      Author: chie8842 <hayashidac@nttdata.co.jp>
      
      Closes #15658 from hayashidac/SPARK-13770.
      
      (cherry picked from commit ee2e741a)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      ef6b6d3d
    • root's avatar
      [SPARK-18137][SQL] Fix RewriteDistinctAggregates UnresolvedException when a... · 3b360e57
      root authored
      [SPARK-18137][SQL] Fix RewriteDistinctAggregates UnresolvedException when a UDAF has a foldable TypeCheck
      
      ## What changes were proposed in this pull request?
      
      In RewriteDistinctAggregates rewrite funtion,after the UDAF's childs are mapped to AttributeRefference, If the UDAF(such as ApproximatePercentile) has a foldable TypeCheck for the input, It will failed because the AttributeRefference is not foldable,then the UDAF is not resolved, and then nullify on the unresolved object will throw a Exception.
      
      In this PR, only map Unfoldable child to AttributeRefference, this can avoid the UDAF's foldable TypeCheck. and then only Expand Unfoldable child, there is no need to Expand a static value(foldable value).
      
      **Before sql result**
      
      > select percentile_approxy(key,0.99999),count(distinct key),sume(distinc key) from src limit 1
      > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'percentile_approx(CAST(src.`key` AS DOUBLE), CAST(0.99999BD AS DOUBLE), 10000)
      > at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:92)
      >     at org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$nullify(RewriteDistinctAggregates.scala:261)
      
      **After sql result**
      
      > select percentile_approxy(key,0.99999),count(distinct key),sume(distinc key) from src limit 1
      > [498.0,309,79136]
      ## How was this patch tested?
      
      Add a test case in HiveUDFSuit.
      
      Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)>
      
      Closes #15668 from windpiger/RewriteDistinctUDAFUnresolveExcep.
      
      (cherry picked from commit c291bd27)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      3b360e57
    • Kazuaki Ishizaki's avatar
      [SPARK-18207][SQL] Fix a compilation error due to HashExpression.doGenCode · ee400f67
      Kazuaki Ishizaki authored
      
      This PR avoids a compilation error due to more than 64KB Java byte code size. This error occur since  generate java code for computing a hash value for a row is too big. This PR fixes this compilation error by splitting a big code chunk into multiple methods by calling `CodegenContext.splitExpression` at `HashExpression.doGenCode`
      
      The test case requires a calculation of hash code for a row that includes 1000 String fields. `HashExpression.doGenCode` generate a lot of Java code for this computation into one function. As a result, the size of the corresponding Java bytecode is more than 64 KB.
      
      Generated code without this PR
      ````java
      /* 027 */   public UnsafeRow apply(InternalRow i) {
      /* 028 */     boolean isNull = false;
      /* 029 */
      /* 030 */     int value1 = 42;
      /* 031 */
      /* 032 */     boolean isNull2 = i.isNullAt(0);
      /* 033 */     UTF8String value2 = isNull2 ? null : (i.getUTF8String(0));
      /* 034 */     if (!isNull2) {
      /* 035 */       value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value2.getBaseObject(), value2.getBaseOffset(), value2.numBytes(), value1);
      /* 036 */     }
      /* 037 */
      /* 038 */
      /* 039 */     boolean isNull3 = i.isNullAt(1);
      /* 040 */     UTF8String value3 = isNull3 ? null : (i.getUTF8String(1));
      /* 041 */     if (!isNull3) {
      /* 042 */       value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value3.getBaseObject(), value3.getBaseOffset(), value3.numBytes(), value1);
      /* 043 */     }
      /* 044 */
      /* 045 */
      ...
      /* 7024 */
      /* 7025 */     boolean isNull1001 = i.isNullAt(999);
      /* 7026 */     UTF8String value1001 = isNull1001 ? null : (i.getUTF8String(999));
      /* 7027 */     if (!isNull1001) {
      /* 7028 */       value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value1001.getBaseObject(), value1001.getBaseOffset(), value1001.numBytes(), value1);
      /* 7029 */     }
      /* 7030 */
      /* 7031 */
      /* 7032 */     boolean isNull1002 = i.isNullAt(1000);
      /* 7033 */     UTF8String value1002 = isNull1002 ? null : (i.getUTF8String(1000));
      /* 7034 */     if (!isNull1002) {
      /* 7035 */       value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value1002.getBaseObject(), value1002.getBaseOffset(), value1002.numBytes(), value1);
      /* 7036 */     }
      ````
      
      Generated code with this PR
      ````java
      /* 3807 */   private void apply_249(InternalRow i) {
      /* 3808 */
      /* 3809 */     boolean isNull998 = i.isNullAt(996);
      /* 3810 */     UTF8String value998 = isNull998 ? null : (i.getUTF8String(996));
      /* 3811 */     if (!isNull998) {
      /* 3812 */       value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value998.getBaseObject(), value998.getBaseOffset(), value998.numBytes(), value1);
      /* 3813 */     }
      /* 3814 */
      /* 3815 */     boolean isNull999 = i.isNullAt(997);
      /* 3816 */     UTF8String value999 = isNull999 ? null : (i.getUTF8String(997));
      /* 3817 */     if (!isNull999) {
      /* 3818 */       value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value999.getBaseObject(), value999.getBaseOffset(), value999.numBytes(), value1);
      /* 3819 */     }
      /* 3820 */
      /* 3821 */     boolean isNull1000 = i.isNullAt(998);
      /* 3822 */     UTF8String value1000 = isNull1000 ? null : (i.getUTF8String(998));
      /* 3823 */     if (!isNull1000) {
      /* 3824 */       value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value1000.getBaseObject(), value1000.getBaseOffset(), value1000.numBytes(), value1);
      /* 3825 */     }
      /* 3826 */
      /* 3827 */     boolean isNull1001 = i.isNullAt(999);
      /* 3828 */     UTF8String value1001 = isNull1001 ? null : (i.getUTF8String(999));
      /* 3829 */     if (!isNull1001) {
      /* 3830 */       value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value1001.getBaseObject(), value1001.getBaseOffset(), value1001.numBytes(), value1);
      /* 3831 */     }
      /* 3832 */
      /* 3833 */   }
      /* 3834 */
      ...
      /* 4532 */   private void apply_0(InternalRow i) {
      /* 4533 */
      /* 4534 */     boolean isNull2 = i.isNullAt(0);
      /* 4535 */     UTF8String value2 = isNull2 ? null : (i.getUTF8String(0));
      /* 4536 */     if (!isNull2) {
      /* 4537 */       value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value2.getBaseObject(), value2.getBaseOffset(), value2.numBytes(), value1);
      /* 4538 */     }
      /* 4539 */
      /* 4540 */     boolean isNull3 = i.isNullAt(1);
      /* 4541 */     UTF8String value3 = isNull3 ? null : (i.getUTF8String(1));
      /* 4542 */     if (!isNull3) {
      /* 4543 */       value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value3.getBaseObject(), value3.getBaseOffset(), value3.numBytes(), value1);
      /* 4544 */     }
      /* 4545 */
      /* 4546 */     boolean isNull4 = i.isNullAt(2);
      /* 4547 */     UTF8String value4 = isNull4 ? null : (i.getUTF8String(2));
      /* 4548 */     if (!isNull4) {
      /* 4549 */       value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value4.getBaseObject(), value4.getBaseOffset(), value4.numBytes(), value1);
      /* 4550 */     }
      /* 4551 */
      /* 4552 */     boolean isNull5 = i.isNullAt(3);
      /* 4553 */     UTF8String value5 = isNull5 ? null : (i.getUTF8String(3));
      /* 4554 */     if (!isNull5) {
      /* 4555 */       value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value5.getBaseObject(), value5.getBaseOffset(), value5.numBytes(), value1);
      /* 4556 */     }
      /* 4557 */
      /* 4558 */   }
      ...
      /* 7344 */   public UnsafeRow apply(InternalRow i) {
      /* 7345 */     boolean isNull = false;
      /* 7346 */
      /* 7347 */     value1 = 42;
      /* 7348 */     apply_0(i);
      /* 7349 */     apply_1(i);
      ...
      /* 7596 */     apply_248(i);
      /* 7597 */     apply_249(i);
      /* 7598 */     apply_250(i);
      /* 7599 */     apply_251(i);
      ...
      ````
      
      Add a new test in `DataFrameSuite`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #15745 from kiszk/SPARK-18207.
      
      (cherry picked from commit 47731e18)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      ee400f67
  4. Nov 07, 2016
    • fidato's avatar
      [SPARK-16575][CORE] partition calculation mismatch with sc.binaryFiles · c8879bf1
      fidato authored
      
      ## What changes were proposed in this pull request?
      
      This Pull request comprises of the critical bug SPARK-16575 changes. This change rectifies the issue with BinaryFileRDD partition calculations as  upon creating an RDD with sc.binaryFiles, the resulting RDD always just consisted of two partitions only.
      ## How was this patch tested?
      
      The original issue ie. getNumPartitions on binary Files RDD (always having two partitions) was first replicated and then tested upon the changes. Also the unit tests have been checked and passed.
      
      This contribution is my original work and I licence the work to the project under the project's open source license
      
      srowen hvanhovell rxin vanzin skyluc kmader zsxwing datafarmer Please have a look .
      
      Author: fidato <fidato.july13@gmail.com>
      
      Closes #15327 from fidato13/SPARK-16575.
      
      (cherry picked from commit 6f369713)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      c8879bf1
    • gatorsmile's avatar
      [SPARK-18217][SQL] Disallow creating permanent views based on temporary views or UDFs · 4cb4e5ff
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Based on the discussion in [SPARK-18209](https://issues.apache.org/jira/browse/SPARK-18209
      
      ). It doesn't really make sense to create permanent views based on temporary views or temporary UDFs.
      
      To disallow the supports and issue the exceptions, this PR needs to detect whether a temporary view/UDF is being used when defining a permanent view. Basically, this PR can be split to two sub-tasks:
      
      **Task 1:** detecting a temporary view from the query plan of view definition.
      When finding an unresolved temporary view, Analyzer replaces it by a `SubqueryAlias` with the corresponding logical plan, which is stored in an in-memory HashMap. After replacement, it is impossible to detect whether the `SubqueryAlias` is added/generated from a temporary view. Thus, to detect the usage of a temporary view in view definition, this PR traverses the unresolved logical plan and uses the name of an `UnresolvedRelation` to detect whether it is a (global) temporary view.
      
      **Task 2:** detecting a temporary UDF from the query plan of view definition.
      Detecting usage of a temporary UDF in view definition is not straightfoward.
      
      First, in the analyzed plan, we are having different forms to represent the functions. More importantly, some classes (e.g., `HiveGenericUDF`) are not accessible from `CreateViewCommand`, which is part of  `sql/core`. Thus, we used the unanalyzed plan `child` of `CreateViewCommand` to detect the usage of a temporary UDF. Because the plan has already been successfully analyzed, we can assume the functions have been defined/registered.
      
      Second, in Spark, the functions have four forms: Spark built-in functions, built-in hash functions, permanent UDFs and temporary UDFs. We do not have any direct way to determine whether a function is temporary or not. Thus, we introduced a function `isTemporaryFunction` in `SessionCatalog`. This function contains the detailed logics to determine whether a function is temporary or not.
      
      ### How was this patch tested?
      Added test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15764 from gatorsmile/blockTempFromPermViewCreation.
      
      (cherry picked from commit 1da64e1f)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      4cb4e5ff
    • Liwei Lin's avatar
      [SPARK-18261][STRUCTURED STREAMING] Add statistics to MemorySink for joining · 4943929d
      Liwei Lin authored
      
      ## What changes were proposed in this pull request?
      
      Right now, there is no way to join the output of a memory sink with any table:
      
      > UnsupportedOperationException: LeafNode MemoryPlan must implement statistics
      
      This patch adds statistics to MemorySink, making joining snapshots of memory streams with tables possible.
      
      ## How was this patch tested?
      
      Added a test case.
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #15786 from lw-lin/memory-sink-stat.
      
      (cherry picked from commit c1a0c66b)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      4943929d
    • Ryan Blue's avatar
      [SPARK-18086] Add support for Hive session vars. · 29f59c73
      Ryan Blue authored
      
      ## What changes were proposed in this pull request?
      
      This adds support for Hive variables:
      
      * Makes values set via `spark-sql --hivevar name=value` accessible
      * Adds `getHiveVar` and `setHiveVar` to the `HiveClient` interface
      * Adds a SessionVariables trait for sessions like Hive that support variables (including Hive vars)
      * Adds SessionVariables support to variable substitution
      * Adds SessionVariables support to the SET command
      
      ## How was this patch tested?
      
      * Adds a test to all supported Hive versions for accessing Hive variables
      * Adds HiveVariableSubstitutionSuite
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #15738 from rdblue/SPARK-18086-add-hivevar-support.
      
      (cherry picked from commit 9b0593d5)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      29f59c73
    • hyukjinkwon's avatar
      [SPARK-18295][SQL] Make to_json function null safe (matching it to from_json) · 4af82d56
      hyukjinkwon authored
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to match up the behaviour of `to_json` to `from_json` function for null-safety.
      
      Currently, it throws `NullPointException` but this PR fixes this to produce `null` instead.
      
      with the data below:
      
      ```scala
      import spark.implicits._
      
      val df = Seq(Some(Tuple1(Tuple1(1))), None).toDF("a")
      df.show()
      ```
      
      ```
      +----+
      |   a|
      +----+
      | [1]|
      |null|
      +----+
      ```
      
      the codes below
      
      ```scala
      import org.apache.spark.sql.functions._
      
      df.select(to_json($"a")).show()
      ```
      
      produces..
      
      **Before**
      
      throws `NullPointException` as below:
      
      ```
      java.lang.NullPointerException
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.org$apache$spark$sql$catalyst$json$JacksonGenerator$$writeFields(JacksonGenerator.scala:138)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator$$anonfun$write$1.apply$mcV$sp(JacksonGenerator.scala:194)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.org$apache$spark$sql$catalyst$json$JacksonGenerator$$writeObject(JacksonGenerator.scala:131)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.write(JacksonGenerator.scala:193)
        at org.apache.spark.sql.catalyst.expressions.StructToJson.eval(jsonExpressions.scala:544)
        at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:142)
        at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:48)
        at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:30)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
      ```
      
      **After**
      
      ```
      +---------------+
      |structtojson(a)|
      +---------------+
      |       {"_1":1}|
      |           null|
      +---------------+
      ```
      
      ## How was this patch tested?
      
      Unit test in `JsonExpressionsSuite.scala` and `JsonFunctionsSuite.scala`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15792 from HyukjinKwon/SPARK-18295.
      
      (cherry picked from commit 3eda0570)
      Signed-off-by: default avatarMichael Armbrust <michael@databricks.com>
      4af82d56
    • Kazuaki Ishizaki's avatar
      [SPARK-17490][SQL] Optimize SerializeFromObject() for a primitive array · 9873d57f
      Kazuaki Ishizaki authored
      Waiting for merging #13680
      
      This PR optimizes `SerializeFromObject()` for an primitive array. This is derived from #13758 to address one of problems by using a simple way in #13758.
      
      The current implementation always generates `GenericArrayData` from `SerializeFromObject()` for any type of an array in a logical plan. This involves a boxing at a constructor of `GenericArrayData` when `SerializedFromObject()` has an primitive array.
      
      This PR enables to generate `UnsafeArrayData` from `SerializeFromObject()` for a primitive array. It can avoid boxing to create an instance of `ArrayData` in the generated code by Catalyst.
      
      This PR also generate `UnsafeArrayData` in a case for `RowEncoder.serializeFor` or `CatalystTypeConverters.createToCatalystConverter`.
      
      Performance improvement of `SerializeFromObject()` is up to 2.0x
      
      ```
      OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.4.11-200.fc22.x86_64
      Intel Xeon E3-12xx v2 (Ivy Bridge)
      
      Without this PR
      Write an array in Dataset:               Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      Int                                            556 /  608         15.1          66.3       1.0X
      Double                                        1668 / 1746          5.0         198.8       0.3X
      
      with this PR
      Write an array in Dataset:               Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      Int                                            352 /  401         23.8          42.0       1.0X
      Double                                         821 /  885         10.2          97.9       0.4X
      ```
      
      Here is an example program that will happen in mllib as described in [SPARK-16070](https://issues.apache.org/jira/browse/SPARK-16070
      
      ).
      
      ```
      sparkContext.parallelize(Seq(Array(1, 2)), 1).toDS.map(e => e).show
      ```
      
      Generated code before applying this PR
      
      ``` java
      /* 039 */   protected void processNext() throws java.io.IOException {
      /* 040 */     while (inputadapter_input.hasNext()) {
      /* 041 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
      /* 042 */       int[] inputadapter_value = (int[])inputadapter_row.get(0, null);
      /* 043 */
      /* 044 */       Object mapelements_obj = ((Expression) references[0]).eval(null);
      /* 045 */       scala.Function1 mapelements_value1 = (scala.Function1) mapelements_obj;
      /* 046 */
      /* 047 */       boolean mapelements_isNull = false || false;
      /* 048 */       int[] mapelements_value = null;
      /* 049 */       if (!mapelements_isNull) {
      /* 050 */         Object mapelements_funcResult = null;
      /* 051 */         mapelements_funcResult = mapelements_value1.apply(inputadapter_value);
      /* 052 */         if (mapelements_funcResult == null) {
      /* 053 */           mapelements_isNull = true;
      /* 054 */         } else {
      /* 055 */           mapelements_value = (int[]) mapelements_funcResult;
      /* 056 */         }
      /* 057 */
      /* 058 */       }
      /* 059 */       mapelements_isNull = mapelements_value == null;
      /* 060 */
      /* 061 */       serializefromobject_argIsNulls[0] = mapelements_isNull;
      /* 062 */       serializefromobject_argValue = mapelements_value;
      /* 063 */
      /* 064 */       boolean serializefromobject_isNull = false;
      /* 065 */       for (int idx = 0; idx < 1; idx++) {
      /* 066 */         if (serializefromobject_argIsNulls[idx]) { serializefromobject_isNull = true; break; }
      /* 067 */       }
      /* 068 */
      /* 069 */       final ArrayData serializefromobject_value = serializefromobject_isNull ? null : new org.apache.spark.sql.catalyst.util.GenericArrayData(serializefromobject_argValue);
      /* 070 */       serializefromobject_holder.reset();
      /* 071 */
      /* 072 */       serializefromobject_rowWriter.zeroOutNullBytes();
      /* 073 */
      /* 074 */       if (serializefromobject_isNull) {
      /* 075 */         serializefromobject_rowWriter.setNullAt(0);
      /* 076 */       } else {
      /* 077 */         // Remember the current cursor so that we can calculate how many bytes are
      /* 078 */         // written later.
      /* 079 */         final int serializefromobject_tmpCursor = serializefromobject_holder.cursor;
      /* 080 */
      /* 081 */         if (serializefromobject_value instanceof UnsafeArrayData) {
      /* 082 */           final int serializefromobject_sizeInBytes = ((UnsafeArrayData) serializefromobject_value).getSizeInBytes();
      /* 083 */           // grow the global buffer before writing data.
      /* 084 */           serializefromobject_holder.grow(serializefromobject_sizeInBytes);
      /* 085 */           ((UnsafeArrayData) serializefromobject_value).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor);
      /* 086 */           serializefromobject_holder.cursor += serializefromobject_sizeInBytes;
      /* 087 */
      /* 088 */         } else {
      /* 089 */           final int serializefromobject_numElements = serializefromobject_value.numElements();
      /* 090 */           serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 4);
      /* 091 */
      /* 092 */           for (int serializefromobject_index = 0; serializefromobject_index < serializefromobject_numElements; serializefromobject_index++) {
      /* 093 */             if (serializefromobject_value.isNullAt(serializefromobject_index)) {
      /* 094 */               serializefromobject_arrayWriter.setNullInt(serializefromobject_index);
      /* 095 */             } else {
      /* 096 */               final int serializefromobject_element = serializefromobject_value.getInt(serializefromobject_index);
      /* 097 */               serializefromobject_arrayWriter.write(serializefromobject_index, serializefromobject_element);
      /* 098 */             }
      /* 099 */           }
      /* 100 */         }
      /* 101 */
      /* 102 */         serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor, serializefromobject_holder.cursor - serializefromobject_tmpCursor);
      /* 103 */       }
      /* 104 */       serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize());
      /* 105 */       append(serializefromobject_result);
      /* 106 */       if (shouldStop()) return;
      /* 107 */     }
      /* 108 */   }
      /* 109 */ }
      ```
      
      Generated code after applying this PR
      
      ``` java
      /* 035 */   protected void processNext() throws java.io.IOException {
      /* 036 */     while (inputadapter_input.hasNext()) {
      /* 037 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
      /* 038 */       int[] inputadapter_value = (int[])inputadapter_row.get(0, null);
      /* 039 */
      /* 040 */       Object mapelements_obj = ((Expression) references[0]).eval(null);
      /* 041 */       scala.Function1 mapelements_value1 = (scala.Function1) mapelements_obj;
      /* 042 */
      /* 043 */       boolean mapelements_isNull = false || false;
      /* 044 */       int[] mapelements_value = null;
      /* 045 */       if (!mapelements_isNull) {
      /* 046 */         Object mapelements_funcResult = null;
      /* 047 */         mapelements_funcResult = mapelements_value1.apply(inputadapter_value);
      /* 048 */         if (mapelements_funcResult == null) {
      /* 049 */           mapelements_isNull = true;
      /* 050 */         } else {
      /* 051 */           mapelements_value = (int[]) mapelements_funcResult;
      /* 052 */         }
      /* 053 */
      /* 054 */       }
      /* 055 */       mapelements_isNull = mapelements_value == null;
      /* 056 */
      /* 057 */       boolean serializefromobject_isNull = mapelements_isNull;
      /* 058 */       final ArrayData serializefromobject_value = serializefromobject_isNull ? null : org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.fromPrimitiveArray(mapelements_value);
      /* 059 */       serializefromobject_isNull = serializefromobject_value == null;
      /* 060 */       serializefromobject_holder.reset();
      /* 061 */
      /* 062 */       serializefromobject_rowWriter.zeroOutNullBytes();
      /* 063 */
      /* 064 */       if (serializefromobject_isNull) {
      /* 065 */         serializefromobject_rowWriter.setNullAt(0);
      /* 066 */       } else {
      /* 067 */         // Remember the current cursor so that we can calculate how many bytes are
      /* 068 */         // written later.
      /* 069 */         final int serializefromobject_tmpCursor = serializefromobject_holder.cursor;
      /* 070 */
      /* 071 */         if (serializefromobject_value instanceof UnsafeArrayData) {
      /* 072 */           final int serializefromobject_sizeInBytes = ((UnsafeArrayData) serializefromobject_value).getSizeInBytes();
      /* 073 */           // grow the global buffer before writing data.
      /* 074 */           serializefromobject_holder.grow(serializefromobject_sizeInBytes);
      /* 075 */           ((UnsafeArrayData) serializefromobject_value).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor);
      /* 076 */           serializefromobject_holder.cursor += serializefromobject_sizeInBytes;
      /* 077 */
      /* 078 */         } else {
      /* 079 */           final int serializefromobject_numElements = serializefromobject_value.numElements();
      /* 080 */           serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 4);
      /* 081 */
      /* 082 */           for (int serializefromobject_index = 0; serializefromobject_index < serializefromobject_numElements; serializefromobject_index++) {
      /* 083 */             if (serializefromobject_value.isNullAt(serializefromobject_index)) {
      /* 084 */               serializefromobject_arrayWriter.setNullInt(serializefromobject_index);
      /* 085 */             } else {
      /* 086 */               final int serializefromobject_element = serializefromobject_value.getInt(serializefromobject_index);
      /* 087 */               serializefromobject_arrayWriter.write(serializefromobject_index, serializefromobject_element);
      /* 088 */             }
      /* 089 */           }
      /* 090 */         }
      /* 091 */
      /* 092 */         serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor, serializefromobject_holder.cursor - serializefromobject_tmpCursor);
      /* 093 */       }
      /* 094 */       serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize());
      /* 095 */       append(serializefromobject_result);
      /* 096 */       if (shouldStop()) return;
      /* 097 */     }
      /* 098 */   }
      /* 099 */ }
      ```
      
      Added a test in `DatasetSuite`, `RowEncoderSuite`, and `CatalystTypeConvertersSuite`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #15044 from kiszk/SPARK-17490.
      
      (cherry picked from commit 19cf2080)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      9873d57f
    • Weiqing Yang's avatar
      [SPARK-17108][SQL] Fix BIGINT and INT comparison failure in spark sql · d1eac3ef
      Weiqing Yang authored
      
      ## What changes were proposed in this pull request?
      
      Add a function to check if two integers are compatible when invoking `acceptsType()` in `DataType`.
      ## How was this patch tested?
      
      Manually.
      E.g.
      
      ```
          spark.sql("create table t3(a map<bigint, array<string>>)")
          spark.sql("select * from t3 where a[1] is not null")
      ```
      
      Before:
      
      ```
      cannot resolve 't.`a`[1]' due to data type mismatch: argument 2 requires bigint type, however, '1' is of int type.; line 1 pos 22
      org.apache.spark.sql.AnalysisException: cannot resolve 't.`a`[1]' due to data type mismatch: argument 2 requires bigint type, however, '1' is of int type.; line 1 pos 22
          at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
          at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82)
          at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
      at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:307)
      ```
      
      After:
       Run the sql queries above. No errors.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #15448 from weiqingy/SPARK_17108.
      
      (cherry picked from commit 0d95662e)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      d1eac3ef
    • Tathagata Das's avatar
      [SPARK-18283][STRUCTURED STREAMING][KAFKA] Added test to check whether default... · 7a84edb2
      Tathagata Das authored
      [SPARK-18283][STRUCTURED STREAMING][KAFKA] Added test to check whether default starting offset in latest
      
      ## What changes were proposed in this pull request?
      
      Added test to check whether default starting offset in latest
      
      ## How was this patch tested?
      new unit test
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #15778 from tdas/SPARK-18283.
      
      (cherry picked from commit b06c23db)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      7a84edb2
    • Yanbo Liang's avatar
      [SPARK-18291][SPARKR][ML] SparkR glm predict should output original label when family = binomial. · 6b332909
      Yanbo Liang authored
      
      ## What changes were proposed in this pull request?
      SparkR ```spark.glm``` predict should output original label when family = "binomial".
      
      ## How was this patch tested?
      Add unit test.
      You can also run the following code to test:
      ```R
      training <- suppressWarnings(createDataFrame(iris))
      training <- training[training$Species %in% c("versicolor", "virginica"), ]
      model <- spark.glm(training, Species ~ Sepal_Length + Sepal_Width,family = binomial(link = "logit"))
      showDF(predict(model, training))
      ```
      Before this change:
      ```
      +------------+-----------+------------+-----------+----------+-----+-------------------+
      |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   Species|label|         prediction|
      +------------+-----------+------------+-----------+----------+-----+-------------------+
      |         7.0|        3.2|         4.7|        1.4|versicolor|  0.0| 0.8271421517601544|
      |         6.4|        3.2|         4.5|        1.5|versicolor|  0.0| 0.6044595910413112|
      |         6.9|        3.1|         4.9|        1.5|versicolor|  0.0| 0.7916340858281998|
      |         5.5|        2.3|         4.0|        1.3|versicolor|  0.0|0.16080518180591158|
      |         6.5|        2.8|         4.6|        1.5|versicolor|  0.0| 0.6112229217050189|
      |         5.7|        2.8|         4.5|        1.3|versicolor|  0.0| 0.2555087295500885|
      |         6.3|        3.3|         4.7|        1.6|versicolor|  0.0| 0.5681507664364834|
      |         4.9|        2.4|         3.3|        1.0|versicolor|  0.0|0.05990570219972002|
      |         6.6|        2.9|         4.6|        1.3|versicolor|  0.0| 0.6644434078306246|
      |         5.2|        2.7|         3.9|        1.4|versicolor|  0.0|0.11293577405862379|
      |         5.0|        2.0|         3.5|        1.0|versicolor|  0.0|0.06152372321585971|
      |         5.9|        3.0|         4.2|        1.5|versicolor|  0.0|0.35250697207602555|
      |         6.0|        2.2|         4.0|        1.0|versicolor|  0.0|0.32267018290814303|
      |         6.1|        2.9|         4.7|        1.4|versicolor|  0.0|  0.433391153814592|
      |         5.6|        2.9|         3.6|        1.3|versicolor|  0.0| 0.2280744262436993|
      |         6.7|        3.1|         4.4|        1.4|versicolor|  0.0| 0.7219848389339459|
      |         5.6|        3.0|         4.5|        1.5|versicolor|  0.0|0.23527698971404695|
      |         5.8|        2.7|         4.1|        1.0|versicolor|  0.0|  0.285024533520016|
      |         6.2|        2.2|         4.5|        1.5|versicolor|  0.0| 0.4107047877447493|
      |         5.6|        2.5|         3.9|        1.1|versicolor|  0.0|0.20083561961645083|
      +------------+-----------+------------+-----------+----------+-----+-------------------+
      ```
      After this change:
      ```
      +------------+-----------+------------+-----------+----------+-----+----------+
      |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   Species|label|prediction|
      +------------+-----------+------------+-----------+----------+-----+----------+
      |         7.0|        3.2|         4.7|        1.4|versicolor|  0.0| virginica|
      |         6.4|        3.2|         4.5|        1.5|versicolor|  0.0| virginica|
      |         6.9|        3.1|         4.9|        1.5|versicolor|  0.0| virginica|
      |         5.5|        2.3|         4.0|        1.3|versicolor|  0.0|versicolor|
      |         6.5|        2.8|         4.6|        1.5|versicolor|  0.0| virginica|
      |         5.7|        2.8|         4.5|        1.3|versicolor|  0.0|versicolor|
      |         6.3|        3.3|         4.7|        1.6|versicolor|  0.0| virginica|
      |         4.9|        2.4|         3.3|        1.0|versicolor|  0.0|versicolor|
      |         6.6|        2.9|         4.6|        1.3|versicolor|  0.0| virginica|
      |         5.2|        2.7|         3.9|        1.4|versicolor|  0.0|versicolor|
      |         5.0|        2.0|         3.5|        1.0|versicolor|  0.0|versicolor|
      |         5.9|        3.0|         4.2|        1.5|versicolor|  0.0|versicolor|
      |         6.0|        2.2|         4.0|        1.0|versicolor|  0.0|versicolor|
      |         6.1|        2.9|         4.7|        1.4|versicolor|  0.0|versicolor|
      |         5.6|        2.9|         3.6|        1.3|versicolor|  0.0|versicolor|
      |         6.7|        3.1|         4.4|        1.4|versicolor|  0.0| virginica|
      |         5.6|        3.0|         4.5|        1.5|versicolor|  0.0|versicolor|
      |         5.8|        2.7|         4.1|        1.0|versicolor|  0.0|versicolor|
      |         6.2|        2.2|         4.5|        1.5|versicolor|  0.0|versicolor|
      |         5.6|        2.5|         3.9|        1.1|versicolor|  0.0|versicolor|
      +------------+-----------+------------+-----------+----------+-----+----------+
      ```
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15788 from yanboliang/spark-18291.
      
      (cherry picked from commit daa975f4)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      6b332909
    • Liang-Chi Hsieh's avatar
      [SPARK-18125][SQL] Fix a compilation error in codegen due to splitExpression · df40ee2b
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      As reported in the jira, sometimes the generated java code in codegen will cause compilation error.
      
      Code snippet to test it:
      
          case class Route(src: String, dest: String, cost: Int)
          case class GroupedRoutes(src: String, dest: String, routes: Seq[Route])
      
          val ds = sc.parallelize(Array(
            Route("a", "b", 1),
            Route("a", "b", 2),
            Route("a", "c", 2),
            Route("a", "d", 10),
            Route("b", "a", 1),
            Route("b", "a", 5),
            Route("b", "c", 6))
          ).toDF.as[Route]
      
          val grped = ds.map(r => GroupedRoutes(r.src, r.dest, Seq(r)))
            .groupByKey(r => (r.src, r.dest))
            .reduceGroups { (g1: GroupedRoutes, g2: GroupedRoutes) =>
              GroupedRoutes(g1.src, g1.dest, g1.routes ++ g2.routes)
            }.map(_._2)
      
      The problem here is, in `ReferenceToExpressions` we evaluate the children vars to local variables. Then the result expression is evaluated to use those children variables. In the above case, the result expression code is too long and will be split by `CodegenContext.splitExpression`. So those local variables cannot be accessed and cause compilation error.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #15693 from viirya/fix-codege-compilation-error.
      
      (cherry picked from commit a814eeac)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      df40ee2b
    • gatorsmile's avatar
      [SPARK-16904][SQL] Removal of Hive Built-in Hash Functions and TestHiveFunctionRegistry · 41010295
      gatorsmile authored
      
      ### What changes were proposed in this pull request?
      
      Currently, the Hive built-in `hash` function is not being used in Spark since Spark 2.0. The public interface does not allow users to unregister the Spark built-in functions. Thus, users will never use Hive's built-in `hash` function.
      
      The only exception here is `TestHiveFunctionRegistry`, which allows users to unregister the built-in functions. Thus, we can load Hive's hash function in the test cases. If we disable it, 10+ test cases will fail because the results are different from the Hive golden answer files.
      
      This PR is to remove `hash` from the list of `hiveFunctions` in `HiveSessionCatalog`. It will also remove `TestHiveFunctionRegistry`. This removal makes us easier to remove `TestHiveSessionState` in the future.
      ### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14498 from gatorsmile/removeHash.
      
      (cherry picked from commit 57626a55)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      41010295
    • Reynold Xin's avatar
      [SPARK-18296][SQL] Use consistent naming for expression test suites · 2fa1a632
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      We have an undocumented naming convention to call expression unit tests ExpressionsSuite, and the end-to-end tests FunctionsSuite. It'd be great to make all test suites consistent with this naming convention.
      
      ## How was this patch tested?
      This is a test-only naming change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15793 from rxin/SPARK-18296.
      
      (cherry picked from commit 9db06c44)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      2fa1a632
    • Reynold Xin's avatar
      [SPARK-18167][SQL] Disable flaky hive partition pruning test. · 9ebd5e56
      Reynold Xin authored
      
      (cherry picked from commit 07ac3f09)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      9ebd5e56
  5. Nov 06, 2016
    • Wenchen Fan's avatar
      [SPARK-18173][SQL] data source tables should support truncating partition · 9c78d355
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      Previously `TRUNCATE TABLE ... PARTITION` will always truncate the whole table for data source tables, this PR fixes it and improve `InMemoryCatalog` to make this command work with it.
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15688 from cloud-fan/truncate.
      
      (cherry picked from commit 46b2e499)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      9c78d355
    • hyukjinkwon's avatar
      [SPARK-18269][SQL] CSV datasource should read null properly when schema is lager than parsed tokens · a8fbcdbf
      hyukjinkwon authored
      
      ## What changes were proposed in this pull request?
      
      Currently, there are the three cases when reading CSV by datasource when it is `PERMISSIVE` parse mode.
      
      - schema == parsed tokens (from each line)
        No problem to cast the value in the tokens to the field in the schema as they are equal.
      
      - schema < parsed tokens (from each line)
        It slices the tokens into the number of fields in schema.
      
      - schema > parsed tokens (from each line)
        It appends `null` into parsed tokens so that safely values can be casted with the schema.
      
      However, when `null` is appended in the third case, we should take `null` into account when casting the values.
      
      In case of `StringType`, it is fine as `UTF8String.fromString(datum)` produces `null` when the input is `null`. Therefore, this case will happen only when schema is explicitly given and schema includes data types that are not `StringType`.
      
      The codes below:
      
      ```scala
      val path = "/tmp/a"
      Seq("1").toDF().write.text(path.getAbsolutePath)
      val schema = StructType(
        StructField("a", IntegerType, true) ::
        StructField("b", IntegerType, true) :: Nil)
      spark.read.schema(schema).option("header", "false").csv(path).show()
      ```
      
      prints
      
      **Before**
      
      ```
      java.lang.NumberFormatException: null
      at java.lang.Integer.parseInt(Integer.java:542)
      at java.lang.Integer.parseInt(Integer.java:615)
      at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
      at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
      at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:24)
      ```
      
      **After**
      
      ```
      +---+----+
      |  a|   b|
      +---+----+
      |  1|null|
      +---+----+
      ```
      
      ## How was this patch tested?
      
      Unit test in `CSVSuite.scala` and `CSVTypeCastSuite.scala`
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15767 from HyukjinKwon/SPARK-18269.
      
      (cherry picked from commit 556a3b7d)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      a8fbcdbf
    • Wojciech Szymanski's avatar
      [SPARK-18210][ML] Pipeline.copy does not create an instance with the same UID · d2f2cf68
      Wojciech Szymanski authored
      
      ## What changes were proposed in this pull request?
      
      Motivation:
      `org.apache.spark.ml.Pipeline.copy(extra: ParamMap)` does not create an instance with the same UID. It does not conform to the method specification from its base class `org.apache.spark.ml.param.Params.copy(extra: ParamMap)`
      
      Solution:
      - fix for Pipeline UID
      - introduced new tests for `org.apache.spark.ml.Pipeline.copy`
      - minor improvements in test for `org.apache.spark.ml.PipelineModel.copy`
      
      ## How was this patch tested?
      
      Introduced new unit test: `org.apache.spark.ml.PipelineSuite."Pipeline.copy"`
      Improved existing unit test: `org.apache.spark.ml.PipelineSuite."PipelineModel.copy"`
      
      Author: Wojciech Szymanski <wk.szymanski@gmail.com>
      
      Closes #15759 from wojtek-szymanski/SPARK-18210.
      
      (cherry picked from commit b89d0556)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      d2f2cf68
    • hyukjinkwon's avatar
      [SPARK-17854][SQL] rand/randn allows null/long as input seed · dcbf3fd4
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes `rand`/`randn` accept `null` as input in Scala/SQL and `LongType` as input in SQL. In this case, it treats the values as `0`.
      
      So, this PR includes both changes below:
      - `null` support
      
        It seems MySQL also accepts this.
      
        ``` sql
        mysql> select rand(0);
        +---------------------+
        | rand(0)             |
        +---------------------+
        | 0.15522042769493574 |
        +---------------------+
        1 row in set (0.00 sec)
      
        mysql> select rand(NULL);
        +---------------------+
        | rand(NULL)          |
        +---------------------+
        | 0.15522042769493574 |
        +---------------------+
        1 row in set (0.00 sec)
        ```
      
        and also Hive does according to [HIVE-14694](https://issues.apache.org/jira/browse/HIVE-14694
      
      )
      
        So the codes below:
      
        ``` scala
        spark.range(1).selectExpr("rand(null)").show()
        ```
      
        prints..
      
        **Before**
      
        ```
          Input argument to rand must be an integer literal.;; line 1 pos 0
        org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0
        at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465)
        at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:444)
        ```
      
        **After**
      
        ```
          +-----------------------+
          |rand(CAST(NULL AS INT))|
          +-----------------------+
          |    0.13385709732307427|
          +-----------------------+
        ```
      - `LongType` support in SQL.
      
        In addition, it make the function allows to take `LongType` consistently within Scala/SQL.
      
        In more details, the codes below:
      
        ``` scala
        spark.range(1).select(rand(1), rand(1L)).show()
        spark.range(1).selectExpr("rand(1)", "rand(1L)").show()
        ```
      
        prints..
      
        **Before**
      
        ```
        +------------------+------------------+
        |           rand(1)|           rand(1)|
        +------------------+------------------+
        |0.2630967864682161|0.2630967864682161|
        +------------------+------------------+
      
        Input argument to rand must be an integer literal.;; line 1 pos 0
        org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0
        at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465)
        at
        ```
      
        **After**
      
        ```
        +------------------+------------------+
        |           rand(1)|           rand(1)|
        +------------------+------------------+
        |0.2630967864682161|0.2630967864682161|
        +------------------+------------------+
      
        +------------------+------------------+
        |           rand(1)|           rand(1)|
        +------------------+------------------+
        |0.2630967864682161|0.2630967864682161|
        +------------------+------------------+
        ```
      ## How was this patch tested?
      
      Unit tests in `DataFrameSuite.scala` and `RandomSuite.scala`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15432 from HyukjinKwon/SPARK-17854.
      
      (cherry picked from commit 340f09d1)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      dcbf3fd4
    • sethah's avatar
      [SPARK-18276][ML] ML models should copy the training summary and set parent · c42301f1
      sethah authored
      
      ## What changes were proposed in this pull request?
      
      Only some of the models which contain a training summary currently set the summaries in the copy method. Linear/Logistic regression do, GLR, GMM, KM, and BKM do not. Additionally, these copy methods did not set the parent pointer of the copied model. This patch modifies the copy methods of the four models mentioned above to copy the training summary and set the parent.
      
      ## How was this patch tested?
      
      Add unit tests in Linear/Logistic/GeneralizedLinear regression and GaussianMixture/KMeans/BisectingKMeans to check the parent pointer of the copied model and check that the copied model has a summary.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15773 from sethah/SPARK-18276.
      
      (cherry picked from commit 23ce0d1e)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      c42301f1
  6. Nov 05, 2016
Loading