Skip to content
Snippets Groups Projects
  1. Sep 01, 2016
    • Sean Owen's avatar
      [SPARK-17331][CORE][MLLIB] Avoid allocating 0-length arrays · 3893e8c5
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Avoid allocating some 0-length arrays, esp. in UTF8String, and by using Array.empty in Scala over Array[T]()
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14895 from srowen/SPARK-17331.
      3893e8c5
    • Herman van Hovell's avatar
      [SPARK-17263][SQL] Add hexadecimal literal parsing · 2be5f8d7
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      This PR adds the ability to parse SQL (hexadecimal) binary literals (AKA bit strings). It follows the following syntax `X'[Hexadecimal Characters]+'`, for example: `X'01AB'` would create a binary the following binary array `0x01AB`.
      
      If an uneven number of hexadecimal characters is passed, then the upper 4 bits of the initial byte are kept empty, and the lower 4 bits are filled using the first character. For example `X'1C7'` would create the following binary array `0x01C7`.
      
      Binary data (Array[Byte]) does not have a proper `hashCode` and `equals` functions. This meant that comparing `Literal`s containing binary data was a pain. I have updated Literal.hashCode and Literal.equals to deal properly with binary data.
      
      ## How was this patch tested?
      Added tests to the `ExpressionParserSuite`, `SQLQueryTestSuite` and `ExpressionSQLBuilderSuite`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #14832 from hvanhovell/SPARK-17263.
      2be5f8d7
    • Angus Gerry's avatar
      [SPARK-16533][CORE] resolve deadlocking in driver when executors die · a0aac4b7
      Angus Gerry authored
      ## What changes were proposed in this pull request?
      This pull request reverts the changes made as a part of #14605, which simply side-steps the deadlock issue. Instead, I propose the following approach:
      * Use `scheduleWithFixedDelay` when calling `ExecutorAllocationManager.schedule` for scheduling executor requests. The intent of this is that if invocations are delayed beyond the default schedule interval on account of lock contention, then we avoid a situation where calls to `schedule` are made back-to-back, potentially releasing and then immediately reacquiring these locks - further exacerbating contention.
      * Replace a number of calls to `askWithRetry` with `ask` inside of message handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us queue messages with the relevant endpoints, release whatever locks we might be holding, and then block whilst awaiting the response. This change is made at the cost of being able to retry should sending the message fail, as retrying outside of the lock could easily cause race conditions if other conflicting messages have been sent whilst awaiting a response. I believe this to be the lesser of two evils, as in many cases these RPC calls are to process local components, and so failures are more likely to be deterministic, and timeouts are more likely to be caused by lock contention.
      
      ## How was this patch tested?
      Existing tests, and manual tests under yarn-client mode.
      
      Author: Angus Gerry <angolon@gmail.com>
      
      Closes #14710 from angolon/SPARK-16533.
      a0aac4b7
    • Tejas Patil's avatar
      [SPARK-17271][SQL] Remove redundant `semanticEquals()` from `SortOrder` · adaaffa3
      Tejas Patil authored
      ## What changes were proposed in this pull request?
      
      Removing `semanticEquals()` from `SortOrder` because it can use the `semanticEquals()` provided by its parent class (`Expression`). This was as per suggestion by cloud-fan at https://github.com/apache/spark/pull/14841/files/7192418b3a26a14642fc04fc92bf496a954ffa5d#r77106801
      
      ## How was this patch tested?
      
      Ran the test added in https://github.com/apache/spark/pull/14841
      
      Author: Tejas Patil <tejasp@fb.com>
      
      Closes #14910 from tejasapatil/SPARK-17271_remove_semantic_ordering.
      adaaffa3
    • Wenchen Fan's avatar
      [SPARK-17257][SQL] the physical plan of CREATE TABLE or CTAS should take CatalogTable · 8e740ae4
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This is kind of a follow-up of https://github.com/apache/spark/pull/14482 . As we put `CatalogTable` in the logical plan directly, it makes sense to let physical plans take `CatalogTable` directly, instead of extracting some fields of `CatalogTable` in planner and then construct a new `CatalogTable` in physical plan.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14823 from cloud-fan/create-table.
      8e740ae4
    • gatorsmile's avatar
      [SPARK-17353][SPARK-16943][SPARK-16942][SQL] Fix multiple bugs in CREATE TABLE LIKE command · 1f06a5b6
      gatorsmile authored
      ### What changes were proposed in this pull request?
      The existing `CREATE TABLE LIKE` command has multiple issues:
      
      - The generated table is non-empty when the source table is a data source table. The major reason is the data source table is using the table property `path` to store the location of table contents. Currently, we keep it unchanged. Thus, we still create the same table with the same location.
      
      - The table type of the generated table is `EXTERNAL` when the source table is an external Hive Serde table. Currently, we explicitly set it to `MANAGED`, but Hive is checking the table property `EXTERNAL` to decide whether the table is `EXTERNAL` or not. (See https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1407-L1408) Thus, the created table is still `EXTERNAL`.
      
      - When the source table is a `VIEW`, the metadata of the generated table contains the original view text and view original text. So far, this does not break anything, but it could cause something wrong in Hive. (For example, https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1405-L1406)
      
      - The issue regarding the table `comment`. To follow what Hive does, the table comment should be cleaned, but the column comments should be still kept.
      
      - The `INDEX` table is not supported. Thus, we should throw an exception in this case.
      
      - `owner` should not be retained. `ToHiveTable` set it [here](https://github.com/apache/spark/blob/e679bc3c1cd418ef0025d2ecbc547c9660cac433/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L793) no matter which value we set in `CatalogTable`. We set it to an empty string for avoiding the confusing output in Explain.
      
      - Add a support for temp tables
      
      - Like Hive, we should not copy the table properties from the source table to the created table, especially for the statistics-related properties, which could be wrong in the created table.
      
      - `unsupportedFeatures` should not be copied from the source table. The created table does not have these unsupported features.
      
      - When the type of source table is a view, the target table is using the default format of data source tables: `spark.sql.sources.default`.
      
      This PR is to fix the above issues.
      
      ### How was this patch tested?
      Improve the test coverage by adding more test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14531 from gatorsmile/createTableLike.
      1f06a5b6
    • Seigneurin, Alexis (CONT)'s avatar
      fixed typos · dd859f95
      Seigneurin, Alexis (CONT) authored
      fixed 2 typos
      
      Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>
      
      Closes #14877 from aseigneurin/fix-typo-2.
      dd859f95
    • Sean Zhong's avatar
      [SPARK-16283][SQL] Implements percentile_approx aggregation function which... · a18c169f
      Sean Zhong authored
      [SPARK-16283][SQL] Implements percentile_approx aggregation function which supports partial aggregation.
      
      ## What changes were proposed in this pull request?
      
      This PR implements aggregation function `percentile_approx`. Function `percentile_approx` returns the approximate percentile(s) of a column at the given percentage(s). A percentile is a watermark value below which a given percentage of the column values fall. For example, the percentile of column `col` at percentage 50% is the median value of column `col`.
      
      ### Syntax:
      ```
      # Returns percentile at a given percentage value. The approximation error can be reduced by increasing parameter accuracy, at the cost of memory.
      percentile_approx(col, percentage [, accuracy])
      
      # Returns percentile value array at given percentage value array
      percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy])
      ```
      
      ### Features:
      1. This function supports partial aggregation.
      2. The memory consumption is bounded. The larger `accuracy` parameter we choose, we smaller error we get. The default accuracy value is 10000, to match with Hive default setting. Choose a smaller value for smaller memory footprint.
      3.  This function supports window function aggregation.
      
      ### Example usages:
      ```
      ## Returns the 25th percentile value, with default accuracy
      SELECT percentile_approx(col, 0.25) FROM table
      
      ## Returns an array of percentile value (25th, 50th, 75th), with default accuracy
      SELECT percentile_approx(col, array(0.25, 0.5, 0.75)) FROM table
      
      ## Returns 25th percentile value, with custom accuracy value 100, larger accuracy parameter yields smaller approximation error
      SELECT percentile_approx(col, 0.25, 100) FROM table
      
      ## Returns the 25th, and 50th percentile values, with custom accuracy value 100
      SELECT percentile_approx(col, array(0.25, 0.5), 100) FROM table
      ```
      
      ### NOTE:
      1. The `percentile_approx` implementation is different from Hive, so the result returned on same query maybe slightly different with Hive. This implementation uses `QuantileSummaries` as the underlying probabilistic data structure, and mainly follows paper `Space-efficient Online Computation of Quantile Summaries` by Greenwald, Michael and Khanna, Sanjeev. (http://dx.doi.org/10.1145/375663.375670)`
      2. The current implementation of `QuantileSummaries` doesn't support automatic compression. This PR has a rule to do compression automatically at the caller side, but it may not be optimal.
      
      ## How was this patch tested?
      
      Unit test, and Sql query test.
      
      ## Acknowledgement
      1. This PR's work in based on lw-lin's PR https://github.com/apache/spark/pull/14298, with improvements like supporting partial aggregation, fixing out of memory issue.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #14868 from clockfly/appro_percentile_try_2.
      a18c169f
    • Sean Owen's avatar
      [SPARK-17329][BUILD] Don't build PRs with -Pyarn unless YARN code changed · 536fa911
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Only build PRs with -Pyarn if YARN code was modified.
      
      ## How was this patch tested?
      
      Jenkins tests (will look to verify whether -Pyarn was included in the PR builder for this one.)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14892 from srowen/SPARK-17329.
      536fa911
    • Shixiong Zhu's avatar
      [SPARK-17318][TESTS] Fix ReplSuite replicating blocks of object with class defined in repl again · 21c0a4fe
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      After digging into the logs, I noticed the failure is because in this test, it starts a local cluster with 2 executors. However, when SparkContext is created, executors may be still not up. When one of the executor is not up during running the job, the blocks won't be replicated.
      
      This PR just adds a wait loop before running the job to fix the flaky test.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #14905 from zsxwing/SPARK-17318-2.
      21c0a4fe
    • Wenchen Fan's avatar
      revert PR#10896 and PR#14865 · aaf632b2
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      according to the discussion in the original PR #10896 and the new approach PR #14876 , we decided to revert these 2 PRs and go with the new approach.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14909 from cloud-fan/revert.
      aaf632b2
  2. Aug 31, 2016
    • Xin Ren's avatar
      [SPARK-17241][SPARKR][MLLIB] SparkR spark.glm should have configurable regularization parameter · 7a5000f3
      Xin Ren authored
      https://issues.apache.org/jira/browse/SPARK-17241
      
      ## What changes were proposed in this pull request?
      
      Spark has configurable L2 regularization parameter for generalized linear regression. It is very important to have them in SparkR so that users can run ridge regression.
      
      ## How was this patch tested?
      
      Test manually on local laptop.
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #14856 from keypointt/SPARK-17241.
      7a5000f3
    • Junyang Qian's avatar
      [SPARKR][MINOR] Fix windowPartitionBy example · d008638f
      Junyang Qian authored
      ## What changes were proposed in this pull request?
      
      The usage in the original example is incorrect. This PR fixes it.
      
      ## How was this patch tested?
      
      Manual test.
      
      Author: Junyang Qian <junyangq@databricks.com>
      
      Closes #14903 from junyangq/SPARKR-FixWindowPartitionByDoc.
      d008638f
    • Shivaram Venkataraman's avatar
      [SPARK-16581][SPARKR] Fix JVM API tests in SparkR · 2f9c2736
      Shivaram Venkataraman authored
      ## What changes were proposed in this pull request?
      
      Remove cleanup.jobj test. Use JVM wrapper API for other test cases.
      
      ## How was this patch tested?
      
      Run R unit tests with testthat 1.0
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #14904 from shivaram/sparkr-jvm-tests-fix.
      2f9c2736
    • Shixiong Zhu's avatar
      [SPARK-17316][TESTS] Fix MesosCoarseGrainedSchedulerBackendSuite · d375c8a3
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      The master is broken because #14882 didn't run mesos tests.
      
      ## How was this patch tested?
      
      Jenkins unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #14902 from zsxwing/hotfix.
      d375c8a3
    • hyukjinkwon's avatar
      [SPARK-17326][SPARKR] Fix tests with HiveContext in SparkR not to be skipped always · 50bb1423
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, `HiveContext` in SparkR is not being tested and always skipped.
      This is because the initiation of `TestHiveContext` is being failed due to trying to load non-existing data paths (test tables).
      
      This is introduced from https://github.com/apache/spark/pull/14005
      
      This enables the tests with SparkR.
      
      ## How was this patch tested?
      
      Manually,
      
      **Before** (on Mac OS)
      
      ```
      ...
      Skipped ------------------------------------------------------------------------
      1. create DataFrame from RDD (test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped
      2. test HiveContext (test_sparkSQL.R#1041) - Hive is not build with SparkSQL, skipped
      3. read/write ORC files (test_sparkSQL.R#1748) - Hive is not build with SparkSQL, skipped
      4. enableHiveSupport on SparkSession (test_sparkSQL.R#2480) - Hive is not build with SparkSQL, skipped
      5. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped
      ...
      ```
      
      **After** (on Mac OS)
      
      ```
      ...
      Skipped ------------------------------------------------------------------------
      1. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped
      ...
      ```
      
      Please refer the tests below (on Windows)
       - Before: https://ci.appveyor.com/project/HyukjinKwon/spark/build/45-test123
       - After: https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14889 from HyukjinKwon/SPARK-17326.
      50bb1423
    • Sean Owen's avatar
      [SPARK-17332][CORE] Make Java Loggers static members · 5d84c7fd
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Make all Java Loggers static members
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14896 from srowen/SPARK-17332.
      5d84c7fd
    • Shixiong Zhu's avatar
      [SPARK-17316][CORE] Make CoarseGrainedSchedulerBackend.removeExecutor non-blocking · 9bcb33c5
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      StandaloneSchedulerBackend.executorRemoved is a blocking call right now. It may cause some deadlock since it's called inside StandaloneAppClient.ClientEndpoint.
      
      This PR just changed CoarseGrainedSchedulerBackend.removeExecutor to be non-blocking. It's safe since the only two usages (StandaloneSchedulerBackend and YarnSchedulerEndpoint) don't need the return value).
      
      ## How was this patch tested?
      
      Jenkins unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #14882 from zsxwing/SPARK-17316.
      9bcb33c5
    • Michael Gummelt's avatar
      [SPARK-17320] add build_profile_flags entry to mesos build module · 0611b3a2
      Michael Gummelt authored
      ## What changes were proposed in this pull request?
      
      add build_profile_flags entry to mesos build module
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #14885 from mgummelt/mesos-profile.
      0611b3a2
    • hyukjinkwon's avatar
      [MINOR][SPARKR] Verbose build comment in WINDOWS.md rather than promoting... · 9953442a
      hyukjinkwon authored
      [MINOR][SPARKR] Verbose build comment in WINDOWS.md rather than promoting default build without Hive
      
      ## What changes were proposed in this pull request?
      
      This PR fixes `WINDOWS.md` to imply referring other profiles in http://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn rather than directly pointing to run `mvn -DskipTests -Psparkr package` without Hive supports.
      
      ## How was this patch tested?
      
      Manually,
      
      <img width="626" alt="2016-08-31 6 01 08" src="https://cloud.githubusercontent.com/assets/6477701/18122549/f6297b2c-6fa4-11e6-9b5e-fd4347355d87.png">
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14890 from HyukjinKwon/minor-build-r.
      9953442a
    • Wenchen Fan's avatar
      [SPARK-17180][SPARK-17309][SPARK-17323][SQL] create AlterViewAsCommand to handle ALTER VIEW AS · 12fd0cd6
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Currently we use `CreateViewCommand` to implement ALTER VIEW AS, which has 3 bugs:
      
      1. SPARK-17180: ALTER VIEW AS should alter temp view if view name has no database part and temp view exists
      2. SPARK-17309: ALTER VIEW AS should issue exception if view does not exist.
      3. SPARK-17323: ALTER VIEW AS should keep the previous table properties, comment, create_time, etc.
      
      The root cause is, ALTER VIEW AS is quite different from CREATE VIEW, we need different code path to handle them. However, in `CreateViewCommand`, there is no way to distinguish ALTER VIEW AS and CREATE VIEW, we have to introduce extra flag. But instead of doing this, I think a more natural way is to separate the ALTER VIEW AS logic into a new command.
      
      ## How was this patch tested?
      
      new tests in SQLViewSuite
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14874 from cloud-fan/minor4.
      12fd0cd6
    • Jeff Zhang's avatar
      [SPARK-17178][SPARKR][SPARKSUBMIT] Allow to set sparkr shell command through --conf · fa634793
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      Allow user to set sparkr shell command through --conf spark.r.shell.command
      
      ## How was this patch tested?
      
      Unit test is added and also verify it manually through
      ```
      bin/sparkr --master yarn-client --conf spark.r.shell.command=/usr/local/bin/R
      ```
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #14744 from zjffdu/SPARK-17178.
      fa634793
  3. Aug 30, 2016
    • Kazuaki Ishizaki's avatar
      [SPARK-15985][SQL] Eliminate redundant cast from an array without null or a map without null · d92cd227
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR eliminates redundant cast from an `ArrayType` with `containsNull = false` or a `MapType` with `containsNull = false`.
      
      For example, in `ArrayType` case, current implementation leaves a cast `cast(value#63 as array<double>).toDoubleArray`. However, we can eliminate `cast(value#63 as array<double>)` if we know `value#63` does not include `null`. This PR apply this elimination for `ArrayType` and `MapType` in `SimplifyCasts` at a plan optimization phase.
      
      In summary, we got 1.2-1.3x performance improvements over the code before applying this PR.
      Here are performance results of benchmark programs:
      ```
        test("Read array in Dataset") {
          import sparkSession.implicits._
      
          val iters = 5
          val n = 1024 * 1024
          val rows = 15
      
          val benchmark = new Benchmark("Read primnitive array", n)
      
          val rand = new Random(511)
          val intDS = sparkSession.sparkContext.parallelize(0 until rows, 1)
            .map(i => Array.tabulate(n)(i => i)).toDS()
          intDS.count() // force to create ds
          val lastElement = n - 1
          val randElement = rand.nextInt(lastElement)
      
          benchmark.addCase(s"Read int array in Dataset", numIters = iters)(iter => {
            val idx0 = randElement
            val idx1 = lastElement
            intDS.map(a => a(0) + a(idx0) + a(idx1)).collect
          })
      
          val doubleDS = sparkSession.sparkContext.parallelize(0 until rows, 1)
            .map(i => Array.tabulate(n)(i => i.toDouble)).toDS()
          doubleDS.count() // force to create ds
      
          benchmark.addCase(s"Read double array in Dataset", numIters = iters)(iter => {
            val idx0 = randElement
            val idx1 = lastElement
            doubleDS.map(a => a(0) + a(idx0) + a(idx1)).collect
          })
      
          benchmark.run()
        }
      
      Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.4
      Intel(R) Core(TM) i5-5257U CPU  2.70GHz
      
      without this PR
      Read primnitive array:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      Read int array in Dataset                      525 /  690          2.0         500.9       1.0X
      Read double array in Dataset                   947 / 1209          1.1         902.7       0.6X
      
      with this PR
      Read primnitive array:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      Read int array in Dataset                      400 /  492          2.6         381.5       1.0X
      Read double array in Dataset                   788 /  870          1.3         751.4       0.5X
      ```
      
      An example program that originally caused this performance issue.
      ```
      val ds = Seq(Array(1.0, 2.0, 3.0), Array(4.0, 5.0, 6.0)).toDS()
      val ds2 = ds.map(p => {
           var s = 0.0
           for (i <- 0 to 2) { s += p(i) }
           s
         })
      ds2.show
      ds2.explain(true)
      ```
      
      Plans before this PR
      ```
      == Parsed Logical Plan ==
      'SerializeFromObject [input[0, double, true] AS value#68]
      +- 'MapElements <function1>, obj#67: double
         +- 'DeserializeToObject unresolveddeserializer(upcast(getcolumnbyordinal(0, ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: "scala.Array").toDoubleArray), obj#66: [D
            +- LocalRelation [value#63]
      
      == Analyzed Logical Plan ==
      value: double
      SerializeFromObject [input[0, double, true] AS value#68]
      +- MapElements <function1>, obj#67: double
         +- DeserializeToObject cast(value#63 as array<double>).toDoubleArray, obj#66: [D
            +- LocalRelation [value#63]
      
      == Optimized Logical Plan ==
      SerializeFromObject [input[0, double, true] AS value#68]
      +- MapElements <function1>, obj#67: double
         +- DeserializeToObject cast(value#63 as array<double>).toDoubleArray, obj#66: [D
            +- LocalRelation [value#63]
      
      == Physical Plan ==
      *SerializeFromObject [input[0, double, true] AS value#68]
      +- *MapElements <function1>, obj#67: double
         +- *DeserializeToObject cast(value#63 as array<double>).toDoubleArray, obj#66: [D
            +- LocalTableScan [value#63]
      ```
      
      Plans after this PR
      ```
      == Parsed Logical Plan ==
      'SerializeFromObject [input[0, double, true] AS value#6]
      +- 'MapElements <function1>, obj#5: double
         +- 'DeserializeToObject unresolveddeserializer(upcast(getcolumnbyordinal(0, ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: "scala.Array").toDoubleArray), obj#4: [D
            +- LocalRelation [value#1]
      
      == Analyzed Logical Plan ==
      value: double
      SerializeFromObject [input[0, double, true] AS value#6]
      +- MapElements <function1>, obj#5: double
         +- DeserializeToObject cast(value#1 as array<double>).toDoubleArray, obj#4: [D
            +- LocalRelation [value#1]
      
      == Optimized Logical Plan ==
      SerializeFromObject [input[0, double, true] AS value#6]
      +- MapElements <function1>, obj#5: double
         +- DeserializeToObject value#1.toDoubleArray, obj#4: [D
            +- LocalRelation [value#1]
      
      == Physical Plan ==
      *SerializeFromObject [input[0, double, true] AS value#6]
      +- *MapElements <function1>, obj#5: double
         +- *DeserializeToObject value#1.toDoubleArray, obj#4: [D
            +- LocalTableScan [value#1]
      ```
      
      ## How was this patch tested?
      
      Tested by new test cases in `SimplifyCastsSuite`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #13704 from kiszk/SPARK-15985.
      d92cd227
    • Shixiong Zhu's avatar
      [SPARK-17318][TESTS] Fix ReplSuite replicating blocks of object with class defined in repl · 231f9732
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      There are a lot of failures recently: http://spark-tests.appspot.com/tests/org.apache.spark.repl.ReplSuite/replicating%20blocks%20of%20object%20with%20class%20defined%20in%20repl
      
      This PR just changed the persist level to `MEMORY_AND_DISK_2` to avoid blocks being evicted from memory.
      
      ## How was this patch tested?
      
      Jenkins unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #14884 from zsxwing/SPARK-17318.
      231f9732
    • Alex Bozarth's avatar
      [SPARK-17243][WEB UI] Spark 2.0 History Server won't load with very large application history · f7beae6d
      Alex Bozarth authored
      ## What changes were proposed in this pull request?
      
      With the new History Server the summary page loads the application list via the the REST API, this makes it very slow to impossible to load with large (10K+) application history. This pr fixes this by adding the `spark.history.ui.maxApplications` conf to limit the number of applications the History Server displays. This is accomplished using a new optional `limit` param for the `applications` api. (Note this only applies to what the summary page displays, all the Application UI's are still accessible if the user knows the App ID and goes to the Application UI directly.)
      
      I've also added a new test for the `limit` param in `HistoryServerSuite.scala`
      
      ## How was this patch tested?
      
      Manual testing and dev/run-tests
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #14835 from ajbozarth/spark17243.
      f7beae6d
    • Shixiong Zhu's avatar
      [SPARK-17314][CORE] Use Netty's DefaultThreadFactory to enable its fast ThreadLocal impl · 02ac379e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      When a thread is a Netty's FastThreadLocalThread, Netty will use its fast ThreadLocal implementation. It has a better performance than JDK's (See the benchmark results in https://github.com/netty/netty/pull/4417, note: it's not a fix to Netty's FastThreadLocal. It just fixed an issue in Netty's benchmark codes)
      
      This PR just changed the ThreadFactory to Netty's DefaultThreadFactory which will use FastThreadLocalThread. There is also a minor change to the thread names. See https://github.com/netty/netty/blob/netty-4.0.22.Final/common/src/main/java/io/netty/util/concurrent/DefaultThreadFactory.java#L94
      
      ## How was this patch tested?
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #14879 from zsxwing/netty-thread.
      02ac379e
    • Josh Rosen's avatar
      [SPARK-17304] Fix perf. issue caused by TaskSetManager.abortIfCompletelyBlacklisted · fb200843
      Josh Rosen authored
      This patch addresses a minor scheduler performance issue that was introduced in #13603. If you run
      
      ```
      sc.parallelize(1 to 100000, 100000).map(identity).count()
      ```
      
      then most of the time ends up being spent in `TaskSetManager.abortIfCompletelyBlacklisted()`:
      
      ![image](https://cloud.githubusercontent.com/assets/50748/18071032/428732b0-6e07-11e6-88b2-c9423cd61f53.png)
      
      When processing resource offers, the scheduler uses a nested loop which considers every task set at multiple locality levels:
      
      ```scala
         for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) {
            do {
              launchedTask = resourceOfferSingleTaskSet(
                  taskSet, maxLocality, shuffledOffers, availableCpus, tasks)
            } while (launchedTask)
          }
      ```
      
      In order to prevent jobs with globally blacklisted tasks from hanging, #13603 added a `taskSet.abortIfCompletelyBlacklisted` call inside of  `resourceOfferSingleTaskSet`; if a call to `resourceOfferSingleTaskSet` fails to schedule any tasks, then `abortIfCompletelyBlacklisted` checks whether the tasks are completely blacklisted in order to figure out whether they will ever be schedulable. The problem with this placement of the call is that the last call to `resourceOfferSingleTaskSet` in the `while` loop will return `false`, implying that  `resourceOfferSingleTaskSet` will call `abortIfCompletelyBlacklisted`, so almost every call to `resourceOffers` will trigger the `abortIfCompletelyBlacklisted` check for every task set.
      
      Instead, I think that this call should be moved out of the innermost loop and should be called _at most_ once per task set in case none of the task set's tasks can be scheduled at any locality level.
      
      Before this patch's changes, the microbenchmark example that I posted above took 35 seconds to run, but it now only takes 15 seconds after this change.
      
      /cc squito and kayousterhout for review.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #14871 from JoshRosen/bail-early-if-no-cpus.
      fb200843
    • Ferdinand Xu's avatar
      [SPARK-5682][CORE] Add encrypted shuffle in spark · 4b4e329e
      Ferdinand Xu authored
      This patch is using Apache Commons Crypto library to enable shuffle encryption support.
      
      Author: Ferdinand Xu <cheng.a.xu@intel.com>
      Author: kellyzly <kellyzly@126.com>
      
      Closes #8880 from winningsix/SPARK-10771.
      4b4e329e
    • Xin Ren's avatar
      [MINOR][MLLIB][SQL] Clean up unused variables and unused import · 27209252
      Xin Ren authored
      ## What changes were proposed in this pull request?
      
      Clean up unused variables and unused import statements, unnecessary `return` and `toArray`, and some more style improvement,  when I walk through the code examples.
      
      ## How was this patch tested?
      
      Testet manually on local laptop.
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #14836 from keypointt/codeWalkThroughML.
      27209252
    • Dmitriy Sokolov's avatar
      [MINOR][DOCS] Fix minor typos in python example code · d4eee993
      Dmitriy Sokolov authored
      ## What changes were proposed in this pull request?
      
      Fix minor typos python example code in streaming programming guide
      
      ## How was this patch tested?
      
      N/A
      
      Author: Dmitriy Sokolov <silentsokolov@gmail.com>
      
      Closes #14805 from silentsokolov/fix-typos.
      d4eee993
    • Sean Owen's avatar
      [SPARK-17264][SQL] DataStreamWriter should document that it only supports Parquet for now · befab9c1
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Clarify that only parquet files are supported by DataStreamWriter now
      
      ## How was this patch tested?
      
      (Doc build -- no functional changes to test)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14860 from srowen/SPARK-17264.
      befab9c1
    • Xin Ren's avatar
      [SPARK-17276][CORE][TEST] Stop env params output on Jenkins job page · 2d76cb11
      Xin Ren authored
      https://issues.apache.org/jira/browse/SPARK-17276
      
      ## What changes were proposed in this pull request?
      
      When trying to find error msg in a failed Jenkins build job, I'm annoyed by the huge env output.
      The env parameter output should be muted.
      
      ![screen shot 2016-08-26 at 10 52 07 pm](https://cloud.githubusercontent.com/assets/3925641/18025581/b8d567ba-6be2-11e6-9eeb-6aec223f1730.png)
      
      ## How was this patch tested?
      
      Tested manually on local laptop.
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #14848 from keypointt/SPARK-17276.
      2d76cb11
    • gatorsmile's avatar
      [SPARK-17234][SQL] Table Existence Checking when Index Table with the Same Name Exists · bca79c82
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Hive Index tables are not supported by Spark SQL. Thus, we issue an exception when users try to access Hive Index tables. When the internal function `tableExists` tries to access Hive Index tables, it always gets the same error message: ```Hive index table is not supported```. This message could be confusing to users, since their SQL operations could be completely unrelated to Hive Index tables. For example, when users try to alter a table to a new name and there exists an index table with the same name, the expected exception should be a `TableAlreadyExistsException`.
      
      This PR made the following changes:
      - Introduced a new `AnalysisException` type: `SQLFeatureNotSupportedException`. When users try to access an `Index Table`, we will issue a `SQLFeatureNotSupportedException`.
      - `tableExists` returns `true` when hitting a `SQLFeatureNotSupportedException` and the feature is `Hive index table`.
      - Add a checking `requireTableNotExists` for `SessionCatalog`'s `createTable` API; otherwise, the current implementation relies on the Hive's internal checking.
      
      ### How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14801 from gatorsmile/tableExists.
      bca79c82
    • Takeshi YAMAMURO's avatar
      [SPARK-17289][SQL] Fix a bug to satisfy sort requirements in partial aggregations · 94922d79
      Takeshi YAMAMURO authored
      ## What changes were proposed in this pull request?
      Partial aggregations are generated in `EnsureRequirements`, but the planner fails to
      check if partial aggregation satisfies sort requirements.
      For the following query:
      ```
      val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", "b").createOrReplaceTempView("t2")
      spark.sql("select max(b) from t2 group by a").explain(true)
      ```
      Now, the SortAggregator won't insert Sort operator before partial aggregation, this will break sort-based partial aggregation.
      ```
      == Physical Plan ==
      SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
      +- *Sort [a#5 ASC], false, 0
         +- Exchange hashpartitioning(a#5, 200)
            +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, max#19])
               +- LocalTableScan [a#5, b#6]
      ```
      Actually, a correct plan is:
      ```
      == Physical Plan ==
      SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
      +- *Sort [a#5 ASC], false, 0
         +- Exchange hashpartitioning(a#5, 200)
            +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, max#19])
               +- *Sort [a#5 ASC], false, 0
                  +- LocalTableScan [a#5, b#6]
      ```
      
      ## How was this patch tested?
      Added tests in `PlannerSuite`.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #14865 from maropu/SPARK-17289.
      94922d79
    • frreiss's avatar
      [SPARK-17303] Added spark-warehouse to dev/.rat-excludes · 8fb445d9
      frreiss authored
      ## What changes were proposed in this pull request?
      
      Excludes the `spark-warehouse` directory from the Apache RAT checks that src/run-tests performs. `spark-warehouse` is created by some of the Spark SQL tests, as well as by `bin/spark-sql`.
      
      ## How was this patch tested?
      
      Ran src/run-tests twice. The second time, the script failed because the first iteration
      Made the change in this PR.
      Ran src/run-tests a third time; RAT checks succeeded.
      
      Author: frreiss <frreiss@us.ibm.com>
      
      Closes #14870 from frreiss/fred-17303.
      8fb445d9
  4. Aug 29, 2016
    • Josh Rosen's avatar
      [SPARK-17301][SQL] Remove unused classTag field from AtomicType base class · 48b459dd
      Josh Rosen authored
      There's an unused `classTag` val in the AtomicType base class which is causing unnecessary slowness in deserialization because it needs to grab ScalaReflectionLock and create a new runtime reflection mirror. Removing this unused code gives a small but measurable performance boost in SQL task deserialization.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #14869 from JoshRosen/remove-unused-classtag.
      48b459dd
    • Shivaram Venkataraman's avatar
      [SPARK-16581][SPARKR] Make JVM backend calling functions public · 736a7911
      Shivaram Venkataraman authored
      ## What changes were proposed in this pull request?
      
      This change exposes a public API in SparkR to create objects, call methods on the Spark driver JVM
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      Unit tests, CRAN checks
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #14775 from shivaram/sparkr-java-api.
      736a7911
    • Davies Liu's avatar
      [SPARK-17063] [SQL] Improve performance of MSCK REPAIR TABLE with Hive metastore · 48caec25
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR split the the single `createPartitions()` call into smaller batches, which could prevent Hive metastore from OOM (caused by millions of partitions).
      
      It will also try to gather all the fast stats (number of files and total size of all files) in parallel to avoid the bottle neck of listing the files in metastore sequential, which is controlled by spark.sql.gatherFastStats (enabled by default).
      
      ## How was this patch tested?
      
      Tested locally with 10000 partitions and 100 files with embedded metastore, without gathering fast stats in parallel, adding partitions took 153 seconds, after enable that, gathering the fast stats took about 34 seconds, adding these partitions took 25 seconds (most of the time spent in object store), 59 seconds in total, 2.5X faster (with larger cluster, gathering will much faster).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #14607 from davies/repair_batch.
      48caec25
    • Junyang Qian's avatar
      [SPARKR][MINOR] Fix LDA doc · 6a0fda2c
      Junyang Qian authored
      ## What changes were proposed in this pull request?
      
      This PR tries to fix the name of the `SparkDataFrame` used in the example. Also, it gives a reference url of an example data file so that users can play with.
      
      ## How was this patch tested?
      
      Manual test.
      
      Author: Junyang Qian <junyangq@databricks.com>
      
      Closes #14853 from junyangq/SPARKR-FixLDADoc.
      6a0fda2c
    • Seigneurin, Alexis (CONT)'s avatar
      fixed a typo · 08913ce0
      Seigneurin, Alexis (CONT) authored
      idempotant -> idempotent
      
      Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>
      
      Closes #14833 from aseigneurin/fix-typo.
      08913ce0
Loading