Skip to content
Snippets Groups Projects
  1. Nov 24, 2017
    • Felix Cheung's avatar
      Preparing Spark release v2.2.1-rc2 · e30e2698
      Felix Cheung authored
      2 tags
      e30e2698
    • Felix Cheung's avatar
      fix typo · c3b5df22
      Felix Cheung authored
      c3b5df22
    • Jakub Nowacki's avatar
      [SPARK-22495] Fix setup of SPARK_HOME variable on Windows · b606cc2b
      Jakub Nowacki authored
      ## What changes were proposed in this pull request?
      
      This is a cherry pick of the original PR 19370 onto branch-2.2 as suggested in https://github.com/apache/spark/pull/19370#issuecomment-346526920.
      
      Fixing the way how `SPARK_HOME` is resolved on Windows. While the previous version was working with the built release download, the set of directories changed slightly for the PySpark `pip` or `conda` install. This has been reflected in Linux files in `bin` but not for Windows `cmd` files.
      
      First fix improves the way how the `jars` directory is found, as this was stoping Windows version of `pip/conda` install from working; JARs were not found by on Session/Context setup.
      
      Second fix is adding `find-spark-home.cmd` script, which uses `find_spark_home.py` script, as the Linux version, to resolve `SPARK_HOME`. It is based on `find-spark-home` bash script, though, some operations are done in different order due to the `cmd` script language limitations. If environment variable is set, the Python script `find_spark_home.py` will not be run. The process can fail if Python is not installed, but it will mostly use this way if PySpark is installed via `pip/conda`, thus, there is some Python in the system.
      
      ## How was this patch tested?
      
      Tested on local installation.
      
      Author: Jakub Nowacki <j.s.nowacki@gmail.com>
      
      Closes #19807 from jsnowacki/fix_spark_cmds_2.
      b606cc2b
    • Kazuaki Ishizaki's avatar
      [SPARK-22595][SQL] fix flaky test: CastSuite.SPARK-22500: cast for struct... · ad57141f
      Kazuaki Ishizaki authored
      [SPARK-22595][SQL] fix flaky test: CastSuite.SPARK-22500: cast for struct should not generate codes beyond 64KB
      
      This PR reduces the number of fields in the test case of `CastSuite` to fix an issue that is pointed at [here](https://github.com/apache/spark/pull/19800#issuecomment-346634950
      
      ).
      
      ```
      java.lang.OutOfMemoryError: GC overhead limit exceeded
      java.lang.OutOfMemoryError: GC overhead limit exceeded
      	at org.codehaus.janino.UnitCompiler.findClass(UnitCompiler.java:10971)
      	at org.codehaus.janino.UnitCompiler.findTypeByName(UnitCompiler.java:7607)
      	at org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5758)
      	at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5732)
      	at org.codehaus.janino.UnitCompiler.access$13200(UnitCompiler.java:206)
      	at org.codehaus.janino.UnitCompiler$18.visitReferenceType(UnitCompiler.java:5668)
      	at org.codehaus.janino.UnitCompiler$18.visitReferenceType(UnitCompiler.java:5660)
      	at org.codehaus.janino.Java$ReferenceType.accept(Java.java:3356)
      	at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5660)
      	at org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2892)
      	at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2764)
      	at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
      	at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
      	at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
      	at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
      	at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
      	at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
      	at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
      	at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
      	at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
      	at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
      	at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
      	at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
      	at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
      	at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
      	at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
      	at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
      	at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
      	at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
      	at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
      	at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
      	at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
      ...
      ```
      
      Used existing test case
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #19806 from kiszk/SPARK-22595.
      
      (cherry picked from commit 554adc77)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      ad57141f
    • Liang-Chi Hsieh's avatar
      [SPARK-22591][SQL] GenerateOrdering shouldn't change CodegenContext.INPUT_ROW · f4c457a3
      Liang-Chi Hsieh authored
      
      ## What changes were proposed in this pull request?
      
      When I played with codegen in developing another PR, I found the value of `CodegenContext.INPUT_ROW` is not reliable. Under wholestage codegen, it is assigned to null first and then suddenly changed to `i`.
      
      The reason is `GenerateOrdering` changes `CodegenContext.INPUT_ROW` but doesn't restore it back.
      
      ## How was this patch tested?
      
      Added test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19800 from viirya/SPARK-22591.
      
      (cherry picked from commit 62a826f1)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      f4c457a3
    • vinodkc's avatar
      [SPARK-17920][SQL] [FOLLOWUP] Backport PR 19779 to branch-2.2 · f8e73d02
      vinodkc authored
      ## What changes were proposed in this pull request?
      
      A followup of  https://github.com/apache/spark/pull/19795 , to simplify the file creation.
      
      ## How was this patch tested?
      
      Only test case is updated
      
      Author: vinodkc <vinod.kc.in@gmail.com>
      
      Closes #19809 from vinodkc/br_FollowupSPARK-17920_branch-2.2.
      f8e73d02
  2. Nov 22, 2017
    • vinodkc's avatar
      [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Backport PR 19779 to branch-2.2 -... · b17f4063
      vinodkc authored
      [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Backport PR 19779 to branch-2.2 - Support writing to Hive table which uses Avro schema url 'avro.schema.url'
      
      ## What changes were proposed in this pull request?
      
      > Backport https://github.com/apache/spark/pull/19779 to branch-2.2
      
      SPARK-19580 Support for avro.schema.url while writing to hive table
      SPARK-19878 Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
      SPARK-17920 HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url
      
      Support writing to Hive table which uses Avro schema url 'avro.schema.url'
      For ex:
      create external table avro_in (a string) stored as avro location '/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
      
      create external table avro_out (a string) stored as avro location '/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
      
      insert overwrite table avro_out select * from avro_in; // fails with java.lang.NullPointerException
      
      WARN AvroSerDe: Encountered exception determining schema. Returning signal schema to indicate problem
      java.lang.NullPointerException
      at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182)
      at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174)
      ## Changes proposed in this fix
      Currently 'null' value is passed to serializer, which causes NPE during insert operation, instead pass Hadoop configuration object
      ## How was this patch tested?
      Added new test case in VersionsSuite
      
      Author: vinodkc <vinod.kc.in@gmail.com>
      
      Closes #19795 from vinodkc/br_Fix_SPARK-17920_branch-2.2.
      b17f4063
  3. Nov 21, 2017
    • Jia Li's avatar
      [SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDBC data source · df9228b4
      Jia Li authored
      ## What changes were proposed in this pull request?
      
      Let’s say I have a nested AND expression shown below and p2 can not be pushed down,
      
      (p1 AND p2) OR p3
      
      In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](https://github.com/apache/spark/pull/10362
      
      ) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing.
      
      Note that:
      - The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not
      - If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression.
      - The current Spark code logic for OR is OK. It either pushes both legs or nothing.
      
      The same translation method is also called by Data Source V2.
      
      ## How was this patch tested?
      
      Added new unit test cases to JDBCSuite
      
      gatorsmile
      
      Author: Jia Li <jiali@us.ibm.com>
      
      Closes #19776 from jliwork/spark-22548.
      
      (cherry picked from commit 881c5c80)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      df9228b4
    • Kazuaki Ishizaki's avatar
      [SPARK-22500][SQL] Fix 64KB JVM bytecode limit problem with cast · 11a599ba
      Kazuaki Ishizaki authored
      
      This PR changes `cast` code generation to place generated code for expression for fields of a structure into separated methods if these size could be large.
      
      Added new test cases into `CastSuite`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #19730 from kiszk/SPARK-22500.
      
      (cherry picked from commit ac10171b)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      11a599ba
    • Kazuaki Ishizaki's avatar
      [SPARK-22550][SQL] Fix 64KB JVM bytecode limit problem with elt · 94f9227d
      Kazuaki Ishizaki authored
      
      This PR changes `elt` code generation to place generated code for expression for arguments into separated methods if these size could be large.
      This PR resolved the case of `elt` with a lot of argument
      
      Added new test cases into `StringExpressionsSuite`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #19778 from kiszk/SPARK-22550.
      
      (cherry picked from commit 9bdff0bc)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      94f9227d
    • Kazuaki Ishizaki's avatar
      [SPARK-22508][SQL] Fix 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create() · 23eb4d70
      Kazuaki Ishizaki authored
      
      ## What changes were proposed in this pull request?
      
      This PR changes `GenerateUnsafeRowJoiner.create()` code generation to place generated code for statements to operate bitmap and offset into separated methods if these size could be large.
      
      ## How was this patch tested?
      
      Added a new test case into `GenerateUnsafeRowJoinerSuite`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #19737 from kiszk/SPARK-22508.
      
      (cherry picked from commit c9577148)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      23eb4d70
  4. Nov 20, 2017
    • Kazuaki Ishizaki's avatar
      [SPARK-22549][SQL] Fix 64KB JVM bytecode limit problem with concat_ws · ca025751
      Kazuaki Ishizaki authored
      
      ## What changes were proposed in this pull request?
      
      This PR changes `concat_ws` code generation to place generated code for expression for arguments into separated methods if these size could be large.
      This PR resolved the case of `concat_ws` with a lot of argument
      
      ## How was this patch tested?
      
      Added new test cases into `StringExpressionsSuite`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #19777 from kiszk/SPARK-22549.
      
      (cherry picked from commit 41c6f360)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      ca025751
  5. Nov 18, 2017
    • Kazuaki Ishizaki's avatar
      [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem with concat · 710d618f
      Kazuaki Ishizaki authored
      
      ## What changes were proposed in this pull request?
      
      This PR changes `concat` code generation to place generated code for expression for arguments into separated methods if these size could be large.
      This PR resolved the case of `concat` with a lot of argument
      
      ## How was this patch tested?
      
      Added new test cases into `StringExpressionsSuite`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #19728 from kiszk/SPARK-22498.
      
      (cherry picked from commit d54bfec2)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      710d618f
  6. Nov 17, 2017
  7. Nov 16, 2017
  8. Nov 15, 2017
    • Dongjoon Hyun's avatar
      [SPARK-22490][DOC] Add PySpark doc for SparkSession.builder · 3cefddee
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      In PySpark API Document, [SparkSession.build](http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html) is not documented and shows default value description.
      ```
      SparkSession.builder = <pyspark.sql.session.Builder object ...
      ```
      
      This PR adds the doc.
      
      ![screen](https://user-images.githubusercontent.com/9700541/32705514-1bdcafaa-c7ca-11e7-88bf-05566fea42de.png
      
      )
      
      The following is the diff of the generated result.
      
      ```
      $ diff old.html new.html
      95a96,101
      > <dl class="attribute">
      > <dt id="pyspark.sql.SparkSession.builder">
      > <code class="descname">builder</code><a class="headerlink" href="#pyspark.sql.SparkSession.builder" title="Permalink to this definition">¶</a></dt>
      > <dd><p>A class attribute having a <a class="reference internal" href="#pyspark.sql.SparkSession.Builder" title="pyspark.sql.SparkSession.Builder"><code class="xref py py-class docutils literal"><span class="pre">Builder</span></code></a> to construct <a class="reference internal" href="#pyspark.sql.SparkSession" title="pyspark.sql.SparkSession"><code class="xref py py-class docutils literal"><span class="pre">SparkSession</span></code></a> instances</p>
      > </dd></dl>
      >
      212,216d217
      < <dt id="pyspark.sql.SparkSession.builder">
      < <code class="descname">builder</code><em class="property"> = &lt;pyspark.sql.session.SparkSession.Builder object&gt;</em><a class="headerlink" href="#pyspark.sql.SparkSession.builder" title="Permalink to this definition">¶</a></dt>
      < <dd></dd></dl>
      <
      < <dl class="attribute">
      ```
      
      ## How was this patch tested?
      
      Manual.
      
      ```
      cd python/docs
      make html
      open _build/html/pyspark.sql.html
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19726 from dongjoon-hyun/SPARK-22490.
      
      (cherry picked from commit aa88b8db)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      3cefddee
  9. Nov 14, 2017
  10. Nov 13, 2017
  11. Nov 12, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-22442][SQL][BRANCH-2.2] ScalaReflection should produce correct field... · f7363779
      Liang-Chi Hsieh authored
      [SPARK-22442][SQL][BRANCH-2.2] ScalaReflection should produce correct field names for special characters
      
      ## What changes were proposed in this pull request?
      
      For a class with field name of special characters, e.g.:
      ```scala
      case class MyType(`field.1`: String, `field 2`: String)
      ```
      
      Although we can manipulate DataFrame/Dataset, the field names are encoded:
      ```scala
      scala> val df = Seq(MyType("a", "b"), MyType("c", "d")).toDF
      df: org.apache.spark.sql.DataFrame = [field$u002E1: string, field$u00202: string]
      scala> df.as[MyType].collect
      res7: Array[MyType] = Array(MyType(a,b), MyType(c,d))
      ```
      
      It causes resolving problem when we try to convert the data with non-encoded field names:
      ```scala
      spark.read.json(path).as[MyType]
      ...
      [info]   org.apache.spark.sql.AnalysisException: cannot resolve '`field$u002E1`' given input columns: [field 2, fie
      ld.1];
      [info]   at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
      ...
      ```
      
      We should use decoded field name in Dataset schema.
      
      ## How was this patch tested?
      
      Added tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19734 from viirya/SPARK-22442-2.2.
      f7363779
    • hyukjinkwon's avatar
      [SPARK-21694][R][ML] Reduce max iterations in Linear SVM test in R to speed up AppVeyor build · 8acd02f4
      hyukjinkwon authored
      
      This PR proposes to reduce max iteration in Linear SVM test in SparkR. This particular test elapses roughly 5 mins on my Mac and over 20 mins on Windows.
      
      The root cause appears, it triggers 2500ish jobs by the default 100 max iterations. In Linux, `daemon.R` is forked but on Windows another process is launched, which is extremely slow.
      
      So, given my observation, there are many processes (not forked) ran on Windows, which makes the differences of elapsed time.
      
      After reducing the max iteration to 10, the total jobs in this single test is reduced to 550ish.
      
      After reducing the max iteration to 5, the total jobs in this single test is reduced to 360ish.
      
      Manually tested the elapsed times.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19722 from HyukjinKwon/SPARK-21693-test.
      
      (cherry picked from commit 3d90b2cb)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      8acd02f4
    • Felix Cheung's avatar
      [SPARK-19606][BUILD][BACKPORT-2.2][MESOS] fix mesos break · 2a04cfaa
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Fix break from cherry pick
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #19732 from felixcheung/fixmesosdriverconstraint.
      2a04cfaa
    • gatorsmile's avatar
      [SPARK-22464][BACKPORT-2.2][SQL] No pushdown for Hive metastore partition... · 95981faa
      gatorsmile authored
      [SPARK-22464][BACKPORT-2.2][SQL] No pushdown for Hive metastore partition predicates containing null-safe equality
      
      ## What changes were proposed in this pull request?
      `<=>` is not supported by Hive metastore partition predicate pushdown. We should not push down it to Hive metastore when they are be using in partition predicates.
      
      ## How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19724 from gatorsmile/backportSPARK-22464.
      95981faa
    • gatorsmile's avatar
      [SPARK-22488][BACKPORT-2.2][SQL] Fix the view resolution issue in the... · 00cb9d0b
      gatorsmile authored
      [SPARK-22488][BACKPORT-2.2][SQL] Fix the view resolution issue in the SparkSession internal table() API
      
      ## What changes were proposed in this pull request?
      
      The current internal `table()` API of `SparkSession` bypasses the Analyzer and directly calls `sessionState.catalog.lookupRelation` API. This skips the view resolution logics in our Analyzer rule `ResolveRelations`. This internal API is widely used by various DDL commands, public and internal APIs.
      
      Users might get the strange error caused by view resolution when the default database is different.
      ```
      Table or view not found: t1; line 1 pos 14
      org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 pos 14
      	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
      ```
      
      This PR is to fix it by enforcing it to use `ResolveRelations` to resolve the table.
      
      ## How was this patch tested?
      Added a test case and modified the existing test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19723 from gatorsmile/backport22488.
      00cb9d0b
    • Kazuaki Ishizaki's avatar
      [SPARK-21720][SQL] Fix 64KB JVM bytecode limit problem with AND or OR · 114dc424
      Kazuaki Ishizaki authored
      
      This PR changes `AND` or `OR` code generation to place condition and then expressions' generated code into separated methods if these size could be large. When the method is newly generated, variables for `isNull` and `value` are declared as an instance variable to pass these values (e.g. `isNull1409` and `value1409`) to the callers of the generated method.
      
      This PR resolved two cases:
      
      * large code size of left expression
      * large code size of right expression
      
      Added a new test case into `CodeGenerationSuite`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #18972 from kiszk/SPARK-21720.
      
      (cherry picked from commit 9bf696db)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      114dc424
    • Paul Mackles's avatar
      [SPARK-19606][MESOS] Support constraints in spark-dispatcher · f6ee3d90
      Paul Mackles authored
      A discussed in SPARK-19606, the addition of a new config property named "spark.mesos.constraints.driver" for constraining drivers running on a Mesos cluster
      
      Corresponding unit test added also tested locally on a Mesos cluster
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Paul Mackles <pmackles@adobe.com>
      
      Closes #19543 from pmackles/SPARK-19606.
      
      (cherry picked from commit b3f9dbf4)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      f6ee3d90
  12. Nov 10, 2017
    • Rekha Joshi's avatar
      [SPARK-21667][STREAMING] ConsoleSink should not fail streaming query with checkpointLocation option · 4ef0bef9
      Rekha Joshi authored
      
      ## What changes were proposed in this pull request?
      Fix to allow recovery on console , avoid checkpoint exception
      
      ## How was this patch tested?
      existing tests
      manual tests [ Replicating error and seeing no checkpoint error after fix]
      
      Author: Rekha Joshi <rekhajoshm@gmail.com>
      Author: rjoshi2 <rekhajoshm@gmail.com>
      
      Closes #19407 from rekhajoshm/SPARK-21667.
      
      (cherry picked from commit 808e886b)
      Signed-off-by: default avatarShixiong Zhu <zsxwing@gmail.com>
      4ef0bef9
    • Shixiong Zhu's avatar
      [SPARK-19644][SQL] Clean up Scala reflection garbage after creating Encoder (branch-2.2) · 8b7f72ed
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Backport #19687 to branch-2.2. The major difference is `cleanUpReflectionObjects` is protected by `ScalaReflectionLock.synchronized` in this PR for Scala 2.10.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <zsxwing@gmail.com>
      
      Closes #19718 from zsxwing/SPARK-19644-2.2.
      8b7f72ed
    • Kazuaki Ishizaki's avatar
      [SPARK-22284][SQL] Fix 64KB JVM bytecode limit problem in calculating hash for nested structs · 6b4ec22e
      Kazuaki Ishizaki authored
      
      ## What changes were proposed in this pull request?
      
      This PR avoids to generate a huge method for calculating a murmur3 hash for nested structs. This PR splits a huge method (e.g. `apply_4`) into multiple smaller methods.
      
      Sample program
      ```
        val structOfString = new StructType().add("str", StringType)
        var inner = new StructType()
        for (_ <- 0 until 800) {
          inner = inner1.add("structOfString", structOfString)
        }
        var schema = new StructType()
        for (_ <- 0 until 50) {
          schema = schema.add("structOfStructOfStrings", inner)
        }
        GenerateMutableProjection.generate(Seq(Murmur3Hash(exprs, 42)))
      ```
      
      Without this PR
      ```
      /* 005 */ class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
      /* 006 */
      /* 007 */   private Object[] references;
      /* 008 */   private InternalRow mutableRow;
      /* 009 */   private int value;
      /* 010 */   private int value_0;
      ...
      /* 034 */   public java.lang.Object apply(java.lang.Object _i) {
      /* 035 */     InternalRow i = (InternalRow) _i;
      /* 036 */
      /* 037 */
      /* 038 */
      /* 039 */     value = 42;
      /* 040 */     apply_0(i);
      /* 041 */     apply_1(i);
      /* 042 */     apply_2(i);
      /* 043 */     apply_3(i);
      /* 044 */     apply_4(i);
      /* 045 */     nestedClassInstance.apply_5(i);
      ...
      /* 089 */     nestedClassInstance8.apply_49(i);
      /* 090 */     value_0 = value;
      /* 091 */
      /* 092 */     // copy all the results into MutableRow
      /* 093 */     mutableRow.setInt(0, value_0);
      /* 094 */     return mutableRow;
      /* 095 */   }
      /* 096 */
      /* 097 */
      /* 098 */   private void apply_4(InternalRow i) {
      /* 099 */
      /* 100 */     boolean isNull5 = i.isNullAt(4);
      /* 101 */     InternalRow value5 = isNull5 ? null : (i.getStruct(4, 800));
      /* 102 */     if (!isNull5) {
      /* 103 */
      /* 104 */       if (!value5.isNullAt(0)) {
      /* 105 */
      /* 106 */         final InternalRow element6400 = value5.getStruct(0, 1);
      /* 107 */
      /* 108 */         if (!element6400.isNullAt(0)) {
      /* 109 */
      /* 110 */           final UTF8String element6401 = element6400.getUTF8String(0);
      /* 111 */           value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6401.getBaseObject(), element6401.getBaseOffset(), element6401.numBytes(), value);
      /* 112 */
      /* 113 */         }
      /* 114 */
      /* 115 */
      /* 116 */       }
      /* 117 */
      /* 118 */
      /* 119 */       if (!value5.isNullAt(1)) {
      /* 120 */
      /* 121 */         final InternalRow element6402 = value5.getStruct(1, 1);
      /* 122 */
      /* 123 */         if (!element6402.isNullAt(0)) {
      /* 124 */
      /* 125 */           final UTF8String element6403 = element6402.getUTF8String(0);
      /* 126 */           value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6403.getBaseObject(), element6403.getBaseOffset(), element6403.numBytes(), value);
      /* 127 */
      /* 128 */         }
      /* 128 */         }
      /* 129 */
      /* 130 */
      /* 131 */       }
      /* 132 */
      /* 133 */
      /* 134 */       if (!value5.isNullAt(2)) {
      /* 135 */
      /* 136 */         final InternalRow element6404 = value5.getStruct(2, 1);
      /* 137 */
      /* 138 */         if (!element6404.isNullAt(0)) {
      /* 139 */
      /* 140 */           final UTF8String element6405 = element6404.getUTF8String(0);
      /* 141 */           value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6405.getBaseObject(), element6405.getBaseOffset(), element6405.numBytes(), value);
      /* 142 */
      /* 143 */         }
      /* 144 */
      /* 145 */
      /* 146 */       }
      /* 147 */
      ...
      /* 12074 */       if (!value5.isNullAt(798)) {
      /* 12075 */
      /* 12076 */         final InternalRow element7996 = value5.getStruct(798, 1);
      /* 12077 */
      /* 12078 */         if (!element7996.isNullAt(0)) {
      /* 12079 */
      /* 12080 */           final UTF8String element7997 = element7996.getUTF8String(0);
      /* 12083 */         }
      /* 12084 */
      /* 12085 */
      /* 12086 */       }
      /* 12087 */
      /* 12088 */
      /* 12089 */       if (!value5.isNullAt(799)) {
      /* 12090 */
      /* 12091 */         final InternalRow element7998 = value5.getStruct(799, 1);
      /* 12092 */
      /* 12093 */         if (!element7998.isNullAt(0)) {
      /* 12094 */
      /* 12095 */           final UTF8String element7999 = element7998.getUTF8String(0);
      /* 12096 */           value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element7999.getBaseObject(), element7999.getBaseOffset(), element7999.numBytes(), value);
      /* 12097 */
      /* 12098 */         }
      /* 12099 */
      /* 12100 */
      /* 12101 */       }
      /* 12102 */
      /* 12103 */     }
      /* 12104 */
      /* 12105 */   }
      /* 12106 */
      /* 12106 */
      /* 12107 */
      /* 12108 */   private void apply_1(InternalRow i) {
      ...
      ```
      
      With this PR
      ```
      /* 005 */ class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
      /* 006 */
      /* 007 */   private Object[] references;
      /* 008 */   private InternalRow mutableRow;
      /* 009 */   private int value;
      /* 010 */   private int value_0;
      /* 011 */
      ...
      /* 034 */   public java.lang.Object apply(java.lang.Object _i) {
      /* 035 */     InternalRow i = (InternalRow) _i;
      /* 036 */
      /* 037 */
      /* 038 */
      /* 039 */     value = 42;
      /* 040 */     nestedClassInstance11.apply50_0(i);
      /* 041 */     nestedClassInstance11.apply50_1(i);
      ...
      /* 088 */     nestedClassInstance11.apply50_48(i);
      /* 089 */     nestedClassInstance11.apply50_49(i);
      /* 090 */     value_0 = value;
      /* 091 */
      /* 092 */     // copy all the results into MutableRow
      /* 093 */     mutableRow.setInt(0, value_0);
      /* 094 */     return mutableRow;
      /* 095 */   }
      /* 096 */
      ...
      /* 37717 */   private void apply4_0(InternalRow value5, InternalRow i) {
      /* 37718 */
      /* 37719 */     if (!value5.isNullAt(0)) {
      /* 37720 */
      /* 37721 */       final InternalRow element6400 = value5.getStruct(0, 1);
      /* 37722 */
      /* 37723 */       if (!element6400.isNullAt(0)) {
      /* 37724 */
      /* 37725 */         final UTF8String element6401 = element6400.getUTF8String(0);
      /* 37726 */         value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6401.getBaseObject(), element6401.getBaseOffset(), element6401.numBytes(), value);
      /* 37727 */
      /* 37728 */       }
      /* 37729 */
      /* 37730 */
      /* 37731 */     }
      /* 37732 */
      /* 37733 */     if (!value5.isNullAt(1)) {
      /* 37734 */
      /* 37735 */       final InternalRow element6402 = value5.getStruct(1, 1);
      /* 37736 */
      /* 37737 */       if (!element6402.isNullAt(0)) {
      /* 37738 */
      /* 37739 */         final UTF8String element6403 = element6402.getUTF8String(0);
      /* 37740 */         value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6403.getBaseObject(), element6403.getBaseOffset(), element6403.numBytes(), value);
      /* 37741 */
      /* 37742 */       }
      /* 37743 */
      /* 37744 */
      /* 37745 */     }
      /* 37746 */
      /* 37747 */     if (!value5.isNullAt(2)) {
      /* 37748 */
      /* 37749 */       final InternalRow element6404 = value5.getStruct(2, 1);
      /* 37750 */
      /* 37751 */       if (!element6404.isNullAt(0)) {
      /* 37752 */
      /* 37753 */         final UTF8String element6405 = element6404.getUTF8String(0);
      /* 37754 */         value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6405.getBaseObject(), element6405.getBaseOffset(), element6405.numBytes(), value);
      /* 37755 */
      /* 37756 */       }
      /* 37757 */
      /* 37758 */
      /* 37759 */     }
      /* 37760 */
      /* 37761 */   }
      ...
      /* 218470 */
      /* 218471 */     private void apply50_4(InternalRow i) {
      /* 218472 */
      /* 218473 */       boolean isNull5 = i.isNullAt(4);
      /* 218474 */       InternalRow value5 = isNull5 ? null : (i.getStruct(4, 800));
      /* 218475 */       if (!isNull5) {
      /* 218476 */         apply4_0(value5, i);
      /* 218477 */         apply4_1(value5, i);
      /* 218478 */         apply4_2(value5, i);
      ...
      /* 218742 */         nestedClassInstance.apply4_266(value5, i);
      /* 218743 */       }
      /* 218744 */
      /* 218745 */     }
      ```
      
      ## How was this patch tested?
      
      Added new test to `HashExpressionsSuite`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #19563 from kiszk/SPARK-22284.
      
      (cherry picked from commit f2da738c)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      6b4ec22e
Loading