Skip to content
Snippets Groups Projects
  1. Aug 31, 2016
    • Michael Gummelt's avatar
      [SPARK-17320] add build_profile_flags entry to mesos build module · 0611b3a2
      Michael Gummelt authored
      ## What changes were proposed in this pull request?
      
      add build_profile_flags entry to mesos build module
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #14885 from mgummelt/mesos-profile.
      0611b3a2
    • hyukjinkwon's avatar
      [MINOR][SPARKR] Verbose build comment in WINDOWS.md rather than promoting... · 9953442a
      hyukjinkwon authored
      [MINOR][SPARKR] Verbose build comment in WINDOWS.md rather than promoting default build without Hive
      
      ## What changes were proposed in this pull request?
      
      This PR fixes `WINDOWS.md` to imply referring other profiles in http://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn rather than directly pointing to run `mvn -DskipTests -Psparkr package` without Hive supports.
      
      ## How was this patch tested?
      
      Manually,
      
      <img width="626" alt="2016-08-31 6 01 08" src="https://cloud.githubusercontent.com/assets/6477701/18122549/f6297b2c-6fa4-11e6-9b5e-fd4347355d87.png">
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14890 from HyukjinKwon/minor-build-r.
      9953442a
    • Wenchen Fan's avatar
      [SPARK-17180][SPARK-17309][SPARK-17323][SQL] create AlterViewAsCommand to handle ALTER VIEW AS · 12fd0cd6
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Currently we use `CreateViewCommand` to implement ALTER VIEW AS, which has 3 bugs:
      
      1. SPARK-17180: ALTER VIEW AS should alter temp view if view name has no database part and temp view exists
      2. SPARK-17309: ALTER VIEW AS should issue exception if view does not exist.
      3. SPARK-17323: ALTER VIEW AS should keep the previous table properties, comment, create_time, etc.
      
      The root cause is, ALTER VIEW AS is quite different from CREATE VIEW, we need different code path to handle them. However, in `CreateViewCommand`, there is no way to distinguish ALTER VIEW AS and CREATE VIEW, we have to introduce extra flag. But instead of doing this, I think a more natural way is to separate the ALTER VIEW AS logic into a new command.
      
      ## How was this patch tested?
      
      new tests in SQLViewSuite
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14874 from cloud-fan/minor4.
      12fd0cd6
    • Jeff Zhang's avatar
      [SPARK-17178][SPARKR][SPARKSUBMIT] Allow to set sparkr shell command through --conf · fa634793
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      Allow user to set sparkr shell command through --conf spark.r.shell.command
      
      ## How was this patch tested?
      
      Unit test is added and also verify it manually through
      ```
      bin/sparkr --master yarn-client --conf spark.r.shell.command=/usr/local/bin/R
      ```
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #14744 from zjffdu/SPARK-17178.
      fa634793
  2. Aug 30, 2016
    • Kazuaki Ishizaki's avatar
      [SPARK-15985][SQL] Eliminate redundant cast from an array without null or a map without null · d92cd227
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR eliminates redundant cast from an `ArrayType` with `containsNull = false` or a `MapType` with `containsNull = false`.
      
      For example, in `ArrayType` case, current implementation leaves a cast `cast(value#63 as array<double>).toDoubleArray`. However, we can eliminate `cast(value#63 as array<double>)` if we know `value#63` does not include `null`. This PR apply this elimination for `ArrayType` and `MapType` in `SimplifyCasts` at a plan optimization phase.
      
      In summary, we got 1.2-1.3x performance improvements over the code before applying this PR.
      Here are performance results of benchmark programs:
      ```
        test("Read array in Dataset") {
          import sparkSession.implicits._
      
          val iters = 5
          val n = 1024 * 1024
          val rows = 15
      
          val benchmark = new Benchmark("Read primnitive array", n)
      
          val rand = new Random(511)
          val intDS = sparkSession.sparkContext.parallelize(0 until rows, 1)
            .map(i => Array.tabulate(n)(i => i)).toDS()
          intDS.count() // force to create ds
          val lastElement = n - 1
          val randElement = rand.nextInt(lastElement)
      
          benchmark.addCase(s"Read int array in Dataset", numIters = iters)(iter => {
            val idx0 = randElement
            val idx1 = lastElement
            intDS.map(a => a(0) + a(idx0) + a(idx1)).collect
          })
      
          val doubleDS = sparkSession.sparkContext.parallelize(0 until rows, 1)
            .map(i => Array.tabulate(n)(i => i.toDouble)).toDS()
          doubleDS.count() // force to create ds
      
          benchmark.addCase(s"Read double array in Dataset", numIters = iters)(iter => {
            val idx0 = randElement
            val idx1 = lastElement
            doubleDS.map(a => a(0) + a(idx0) + a(idx1)).collect
          })
      
          benchmark.run()
        }
      
      Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.4
      Intel(R) Core(TM) i5-5257U CPU  2.70GHz
      
      without this PR
      Read primnitive array:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      Read int array in Dataset                      525 /  690          2.0         500.9       1.0X
      Read double array in Dataset                   947 / 1209          1.1         902.7       0.6X
      
      with this PR
      Read primnitive array:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      Read int array in Dataset                      400 /  492          2.6         381.5       1.0X
      Read double array in Dataset                   788 /  870          1.3         751.4       0.5X
      ```
      
      An example program that originally caused this performance issue.
      ```
      val ds = Seq(Array(1.0, 2.0, 3.0), Array(4.0, 5.0, 6.0)).toDS()
      val ds2 = ds.map(p => {
           var s = 0.0
           for (i <- 0 to 2) { s += p(i) }
           s
         })
      ds2.show
      ds2.explain(true)
      ```
      
      Plans before this PR
      ```
      == Parsed Logical Plan ==
      'SerializeFromObject [input[0, double, true] AS value#68]
      +- 'MapElements <function1>, obj#67: double
         +- 'DeserializeToObject unresolveddeserializer(upcast(getcolumnbyordinal(0, ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: "scala.Array").toDoubleArray), obj#66: [D
            +- LocalRelation [value#63]
      
      == Analyzed Logical Plan ==
      value: double
      SerializeFromObject [input[0, double, true] AS value#68]
      +- MapElements <function1>, obj#67: double
         +- DeserializeToObject cast(value#63 as array<double>).toDoubleArray, obj#66: [D
            +- LocalRelation [value#63]
      
      == Optimized Logical Plan ==
      SerializeFromObject [input[0, double, true] AS value#68]
      +- MapElements <function1>, obj#67: double
         +- DeserializeToObject cast(value#63 as array<double>).toDoubleArray, obj#66: [D
            +- LocalRelation [value#63]
      
      == Physical Plan ==
      *SerializeFromObject [input[0, double, true] AS value#68]
      +- *MapElements <function1>, obj#67: double
         +- *DeserializeToObject cast(value#63 as array<double>).toDoubleArray, obj#66: [D
            +- LocalTableScan [value#63]
      ```
      
      Plans after this PR
      ```
      == Parsed Logical Plan ==
      'SerializeFromObject [input[0, double, true] AS value#6]
      +- 'MapElements <function1>, obj#5: double
         +- 'DeserializeToObject unresolveddeserializer(upcast(getcolumnbyordinal(0, ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: "scala.Array").toDoubleArray), obj#4: [D
            +- LocalRelation [value#1]
      
      == Analyzed Logical Plan ==
      value: double
      SerializeFromObject [input[0, double, true] AS value#6]
      +- MapElements <function1>, obj#5: double
         +- DeserializeToObject cast(value#1 as array<double>).toDoubleArray, obj#4: [D
            +- LocalRelation [value#1]
      
      == Optimized Logical Plan ==
      SerializeFromObject [input[0, double, true] AS value#6]
      +- MapElements <function1>, obj#5: double
         +- DeserializeToObject value#1.toDoubleArray, obj#4: [D
            +- LocalRelation [value#1]
      
      == Physical Plan ==
      *SerializeFromObject [input[0, double, true] AS value#6]
      +- *MapElements <function1>, obj#5: double
         +- *DeserializeToObject value#1.toDoubleArray, obj#4: [D
            +- LocalTableScan [value#1]
      ```
      
      ## How was this patch tested?
      
      Tested by new test cases in `SimplifyCastsSuite`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #13704 from kiszk/SPARK-15985.
      d92cd227
    • Shixiong Zhu's avatar
      [SPARK-17318][TESTS] Fix ReplSuite replicating blocks of object with class defined in repl · 231f9732
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      There are a lot of failures recently: http://spark-tests.appspot.com/tests/org.apache.spark.repl.ReplSuite/replicating%20blocks%20of%20object%20with%20class%20defined%20in%20repl
      
      This PR just changed the persist level to `MEMORY_AND_DISK_2` to avoid blocks being evicted from memory.
      
      ## How was this patch tested?
      
      Jenkins unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #14884 from zsxwing/SPARK-17318.
      231f9732
    • Alex Bozarth's avatar
      [SPARK-17243][WEB UI] Spark 2.0 History Server won't load with very large application history · f7beae6d
      Alex Bozarth authored
      ## What changes were proposed in this pull request?
      
      With the new History Server the summary page loads the application list via the the REST API, this makes it very slow to impossible to load with large (10K+) application history. This pr fixes this by adding the `spark.history.ui.maxApplications` conf to limit the number of applications the History Server displays. This is accomplished using a new optional `limit` param for the `applications` api. (Note this only applies to what the summary page displays, all the Application UI's are still accessible if the user knows the App ID and goes to the Application UI directly.)
      
      I've also added a new test for the `limit` param in `HistoryServerSuite.scala`
      
      ## How was this patch tested?
      
      Manual testing and dev/run-tests
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #14835 from ajbozarth/spark17243.
      f7beae6d
    • Shixiong Zhu's avatar
      [SPARK-17314][CORE] Use Netty's DefaultThreadFactory to enable its fast ThreadLocal impl · 02ac379e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      When a thread is a Netty's FastThreadLocalThread, Netty will use its fast ThreadLocal implementation. It has a better performance than JDK's (See the benchmark results in https://github.com/netty/netty/pull/4417, note: it's not a fix to Netty's FastThreadLocal. It just fixed an issue in Netty's benchmark codes)
      
      This PR just changed the ThreadFactory to Netty's DefaultThreadFactory which will use FastThreadLocalThread. There is also a minor change to the thread names. See https://github.com/netty/netty/blob/netty-4.0.22.Final/common/src/main/java/io/netty/util/concurrent/DefaultThreadFactory.java#L94
      
      ## How was this patch tested?
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #14879 from zsxwing/netty-thread.
      02ac379e
    • Josh Rosen's avatar
      [SPARK-17304] Fix perf. issue caused by TaskSetManager.abortIfCompletelyBlacklisted · fb200843
      Josh Rosen authored
      This patch addresses a minor scheduler performance issue that was introduced in #13603. If you run
      
      ```
      sc.parallelize(1 to 100000, 100000).map(identity).count()
      ```
      
      then most of the time ends up being spent in `TaskSetManager.abortIfCompletelyBlacklisted()`:
      
      ![image](https://cloud.githubusercontent.com/assets/50748/18071032/428732b0-6e07-11e6-88b2-c9423cd61f53.png)
      
      When processing resource offers, the scheduler uses a nested loop which considers every task set at multiple locality levels:
      
      ```scala
         for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) {
            do {
              launchedTask = resourceOfferSingleTaskSet(
                  taskSet, maxLocality, shuffledOffers, availableCpus, tasks)
            } while (launchedTask)
          }
      ```
      
      In order to prevent jobs with globally blacklisted tasks from hanging, #13603 added a `taskSet.abortIfCompletelyBlacklisted` call inside of  `resourceOfferSingleTaskSet`; if a call to `resourceOfferSingleTaskSet` fails to schedule any tasks, then `abortIfCompletelyBlacklisted` checks whether the tasks are completely blacklisted in order to figure out whether they will ever be schedulable. The problem with this placement of the call is that the last call to `resourceOfferSingleTaskSet` in the `while` loop will return `false`, implying that  `resourceOfferSingleTaskSet` will call `abortIfCompletelyBlacklisted`, so almost every call to `resourceOffers` will trigger the `abortIfCompletelyBlacklisted` check for every task set.
      
      Instead, I think that this call should be moved out of the innermost loop and should be called _at most_ once per task set in case none of the task set's tasks can be scheduled at any locality level.
      
      Before this patch's changes, the microbenchmark example that I posted above took 35 seconds to run, but it now only takes 15 seconds after this change.
      
      /cc squito and kayousterhout for review.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #14871 from JoshRosen/bail-early-if-no-cpus.
      fb200843
    • Ferdinand Xu's avatar
      [SPARK-5682][CORE] Add encrypted shuffle in spark · 4b4e329e
      Ferdinand Xu authored
      This patch is using Apache Commons Crypto library to enable shuffle encryption support.
      
      Author: Ferdinand Xu <cheng.a.xu@intel.com>
      Author: kellyzly <kellyzly@126.com>
      
      Closes #8880 from winningsix/SPARK-10771.
      4b4e329e
    • Xin Ren's avatar
      [MINOR][MLLIB][SQL] Clean up unused variables and unused import · 27209252
      Xin Ren authored
      ## What changes were proposed in this pull request?
      
      Clean up unused variables and unused import statements, unnecessary `return` and `toArray`, and some more style improvement,  when I walk through the code examples.
      
      ## How was this patch tested?
      
      Testet manually on local laptop.
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #14836 from keypointt/codeWalkThroughML.
      27209252
    • Dmitriy Sokolov's avatar
      [MINOR][DOCS] Fix minor typos in python example code · d4eee993
      Dmitriy Sokolov authored
      ## What changes were proposed in this pull request?
      
      Fix minor typos python example code in streaming programming guide
      
      ## How was this patch tested?
      
      N/A
      
      Author: Dmitriy Sokolov <silentsokolov@gmail.com>
      
      Closes #14805 from silentsokolov/fix-typos.
      d4eee993
    • Sean Owen's avatar
      [SPARK-17264][SQL] DataStreamWriter should document that it only supports Parquet for now · befab9c1
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Clarify that only parquet files are supported by DataStreamWriter now
      
      ## How was this patch tested?
      
      (Doc build -- no functional changes to test)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14860 from srowen/SPARK-17264.
      befab9c1
    • Xin Ren's avatar
      [SPARK-17276][CORE][TEST] Stop env params output on Jenkins job page · 2d76cb11
      Xin Ren authored
      https://issues.apache.org/jira/browse/SPARK-17276
      
      ## What changes were proposed in this pull request?
      
      When trying to find error msg in a failed Jenkins build job, I'm annoyed by the huge env output.
      The env parameter output should be muted.
      
      ![screen shot 2016-08-26 at 10 52 07 pm](https://cloud.githubusercontent.com/assets/3925641/18025581/b8d567ba-6be2-11e6-9eeb-6aec223f1730.png)
      
      ## How was this patch tested?
      
      Tested manually on local laptop.
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #14848 from keypointt/SPARK-17276.
      2d76cb11
    • gatorsmile's avatar
      [SPARK-17234][SQL] Table Existence Checking when Index Table with the Same Name Exists · bca79c82
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Hive Index tables are not supported by Spark SQL. Thus, we issue an exception when users try to access Hive Index tables. When the internal function `tableExists` tries to access Hive Index tables, it always gets the same error message: ```Hive index table is not supported```. This message could be confusing to users, since their SQL operations could be completely unrelated to Hive Index tables. For example, when users try to alter a table to a new name and there exists an index table with the same name, the expected exception should be a `TableAlreadyExistsException`.
      
      This PR made the following changes:
      - Introduced a new `AnalysisException` type: `SQLFeatureNotSupportedException`. When users try to access an `Index Table`, we will issue a `SQLFeatureNotSupportedException`.
      - `tableExists` returns `true` when hitting a `SQLFeatureNotSupportedException` and the feature is `Hive index table`.
      - Add a checking `requireTableNotExists` for `SessionCatalog`'s `createTable` API; otherwise, the current implementation relies on the Hive's internal checking.
      
      ### How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14801 from gatorsmile/tableExists.
      bca79c82
    • Takeshi YAMAMURO's avatar
      [SPARK-17289][SQL] Fix a bug to satisfy sort requirements in partial aggregations · 94922d79
      Takeshi YAMAMURO authored
      ## What changes were proposed in this pull request?
      Partial aggregations are generated in `EnsureRequirements`, but the planner fails to
      check if partial aggregation satisfies sort requirements.
      For the following query:
      ```
      val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", "b").createOrReplaceTempView("t2")
      spark.sql("select max(b) from t2 group by a").explain(true)
      ```
      Now, the SortAggregator won't insert Sort operator before partial aggregation, this will break sort-based partial aggregation.
      ```
      == Physical Plan ==
      SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
      +- *Sort [a#5 ASC], false, 0
         +- Exchange hashpartitioning(a#5, 200)
            +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, max#19])
               +- LocalTableScan [a#5, b#6]
      ```
      Actually, a correct plan is:
      ```
      == Physical Plan ==
      SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
      +- *Sort [a#5 ASC], false, 0
         +- Exchange hashpartitioning(a#5, 200)
            +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, max#19])
               +- *Sort [a#5 ASC], false, 0
                  +- LocalTableScan [a#5, b#6]
      ```
      
      ## How was this patch tested?
      Added tests in `PlannerSuite`.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #14865 from maropu/SPARK-17289.
      94922d79
    • frreiss's avatar
      [SPARK-17303] Added spark-warehouse to dev/.rat-excludes · 8fb445d9
      frreiss authored
      ## What changes were proposed in this pull request?
      
      Excludes the `spark-warehouse` directory from the Apache RAT checks that src/run-tests performs. `spark-warehouse` is created by some of the Spark SQL tests, as well as by `bin/spark-sql`.
      
      ## How was this patch tested?
      
      Ran src/run-tests twice. The second time, the script failed because the first iteration
      Made the change in this PR.
      Ran src/run-tests a third time; RAT checks succeeded.
      
      Author: frreiss <frreiss@us.ibm.com>
      
      Closes #14870 from frreiss/fred-17303.
      8fb445d9
  3. Aug 29, 2016
    • Josh Rosen's avatar
      [SPARK-17301][SQL] Remove unused classTag field from AtomicType base class · 48b459dd
      Josh Rosen authored
      There's an unused `classTag` val in the AtomicType base class which is causing unnecessary slowness in deserialization because it needs to grab ScalaReflectionLock and create a new runtime reflection mirror. Removing this unused code gives a small but measurable performance boost in SQL task deserialization.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #14869 from JoshRosen/remove-unused-classtag.
      48b459dd
    • Shivaram Venkataraman's avatar
      [SPARK-16581][SPARKR] Make JVM backend calling functions public · 736a7911
      Shivaram Venkataraman authored
      ## What changes were proposed in this pull request?
      
      This change exposes a public API in SparkR to create objects, call methods on the Spark driver JVM
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      Unit tests, CRAN checks
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #14775 from shivaram/sparkr-java-api.
      736a7911
    • Davies Liu's avatar
      [SPARK-17063] [SQL] Improve performance of MSCK REPAIR TABLE with Hive metastore · 48caec25
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR split the the single `createPartitions()` call into smaller batches, which could prevent Hive metastore from OOM (caused by millions of partitions).
      
      It will also try to gather all the fast stats (number of files and total size of all files) in parallel to avoid the bottle neck of listing the files in metastore sequential, which is controlled by spark.sql.gatherFastStats (enabled by default).
      
      ## How was this patch tested?
      
      Tested locally with 10000 partitions and 100 files with embedded metastore, without gathering fast stats in parallel, adding partitions took 153 seconds, after enable that, gathering the fast stats took about 34 seconds, adding these partitions took 25 seconds (most of the time spent in object store), 59 seconds in total, 2.5X faster (with larger cluster, gathering will much faster).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #14607 from davies/repair_batch.
      48caec25
    • Junyang Qian's avatar
      [SPARKR][MINOR] Fix LDA doc · 6a0fda2c
      Junyang Qian authored
      ## What changes were proposed in this pull request?
      
      This PR tries to fix the name of the `SparkDataFrame` used in the example. Also, it gives a reference url of an example data file so that users can play with.
      
      ## How was this patch tested?
      
      Manual test.
      
      Author: Junyang Qian <junyangq@databricks.com>
      
      Closes #14853 from junyangq/SPARKR-FixLDADoc.
      6a0fda2c
    • Seigneurin, Alexis (CONT)'s avatar
      fixed a typo · 08913ce0
      Seigneurin, Alexis (CONT) authored
      idempotant -> idempotent
      
      Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>
      
      Closes #14833 from aseigneurin/fix-typo.
      08913ce0
    • Sean Owen's avatar
      [BUILD] Closes some stale PRs. · 1a48c004
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Closes #10995
      Closes #13658
      Closes #14505
      Closes #14536
      Closes #12753
      Closes #14449
      Closes #12694
      Closes #12695
      Closes #14810
      Closes #10572
      
      ## How was this patch tested?
      
      N/A
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14849 from srowen/CloseStalePRs.
      1a48c004
  4. Aug 28, 2016
    • Tejas Patil's avatar
      [SPARK-17271][SQL] Planner adds un-necessary Sort even if child ordering is... · 095862a3
      Tejas Patil authored
      [SPARK-17271][SQL] Planner adds un-necessary Sort even if child ordering is semantically same as required ordering
      
      ## What changes were proposed in this pull request?
      
      Jira : https://issues.apache.org/jira/browse/SPARK-17271
      
      Planner is adding un-needed SORT operation due to bug in the way comparison for `SortOrder` is done at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L253
      `SortOrder` needs to be compared semantically because `Expression` within two `SortOrder` can be "semantically equal" but not literally equal objects.
      
      eg. In case of `sql("SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1")`
      
      Expression in required SortOrder:
      ```
            AttributeReference(
              name = "col1",
              dataType = LongType,
              nullable = false
            ) (exprId = exprId,
              qualifier = Some("a")
            )
      ```
      
      Expression in child SortOrder:
      ```
            AttributeReference(
              name = "col1",
              dataType = LongType,
              nullable = false
            ) (exprId = exprId)
      ```
      
      Notice that the output column has a qualifier but the child attribute does not but the inherent expression is the same and hence in this case we can say that the child satisfies the required sort order.
      
      This PR includes following changes:
      - Added a `semanticEquals` method to `SortOrder` so that it can compare underlying child expressions semantically (and not using default Object.equals)
      - Fixed `EnsureRequirements` to use semantic comparison of SortOrder
      
      ## How was this patch tested?
      
      - Added a test case to `PlannerSuite`. Ran rest tests in `PlannerSuite`
      
      Author: Tejas Patil <tejasp@fb.com>
      
      Closes #14841 from tejasapatil/SPARK-17271_sort_order_equals_bug.
      095862a3
  5. Aug 27, 2016
  6. Aug 26, 2016
    • Reynold Xin's avatar
      [SPARK-17270][SQL] Move object optimization rules into its own file · cc0caa69
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      As part of breaking Optimizer.scala apart, this patch moves various Dataset object optimization rules into a single file. I'm submitting separate pull requests so we can more easily merge this in branch-2.0 to simplify optimizer backports.
      
      ## How was this patch tested?
      This should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14839 from rxin/SPARK-17270.
      cc0caa69
    • Yin Huai's avatar
      [SPARK-17266][TEST] Add empty strings to the regressionTests of PrefixComparatorsSuite · a6bca3ad
      Yin Huai authored
      ## What changes were proposed in this pull request?
      This PR adds a regression test to PrefixComparatorsSuite's "String prefix comparator" because this test failed on jenkins once (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1620/testReport/junit/org.apache.spark.util.collection.unsafe.sort/PrefixComparatorsSuite/String_prefix_comparator/).
      
      I could not reproduce it locally. But, let's this test case in the regressionTests.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #14837 from yhuai/SPARK-17266.
      a6bca3ad
    • Sameer Agarwal's avatar
      [SPARK-17244] Catalyst should not pushdown non-deterministic join conditions · 540e9128
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      Given that non-deterministic expressions can be stateful, pushing them down the query plan during the optimization phase can cause incorrect behavior. This patch fixes that issue by explicitly disabling that.
      
      ## How was this patch tested?
      
      A new test in `FilterPushdownSuite` that checks catalyst behavior for both deterministic and non-deterministic join conditions.
      
      Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
      
      Closes #14815 from sameeragarwal/constraint-inputfile.
      540e9128
    • petermaxlee's avatar
      [SPARK-17235][SQL] Support purging of old logs in MetadataLog · f64a1ddd
      petermaxlee authored
      ## What changes were proposed in this pull request?
      This patch adds a purge interface to MetadataLog, and an implementation in HDFSMetadataLog. The purge function is currently unused, but I will use it to purge old execution and file source logs in follow-up patches. These changes are required in a production structured streaming job that runs for a long period of time.
      
      ## How was this patch tested?
      Added a unit test case in HDFSMetadataLogSuite.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14802 from petermaxlee/SPARK-17235.
      f64a1ddd
    • Herman van Hovell's avatar
      [SPARK-17246][SQL] Add BigDecimal literal · a11d10f1
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      This PR adds parser support for `BigDecimal` literals. If you append the suffix `BD` to a valid number then this will be interpreted as a `BigDecimal`, for example `12.0E10BD` will interpreted into a BigDecimal with scale -9 and precision 3. This is useful in situations where you need exact values.
      
      ## How was this patch tested?
      Added tests to `ExpressionParserSuite`, `ExpressionSQLBuilderSuite` and `SQLQueryTestSuite`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #14819 from hvanhovell/SPARK-17246.
      a11d10f1
    • Michael Gummelt's avatar
      [SPARK-16967] move mesos to module · 8e5475be
      Michael Gummelt authored
      ## What changes were proposed in this pull request?
      
      Move Mesos code into a mvn module
      
      ## How was this patch tested?
      
      unit tests
      manually submitting a client mode and cluster mode job
      spark/mesos integration test suite
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #14637 from mgummelt/mesos-module.
      8e5475be
    • Peng, Meng's avatar
      [SPARK-17207][MLLIB] fix comparing Vector bug in TestingUtils · c0949dc9
      Peng, Meng authored
      ## What changes were proposed in this pull request?
      
      fix comparing Vector bug in TestingUtils.
      There is the same bug for Matrix comparing. How to check the length of Matrix should be discussed first.
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: Peng, Meng <peng.meng@intel.com>
      
      Closes #14785 from mpjlu/testUtils.
      c0949dc9
    • petermaxlee's avatar
      [SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely · 9812f7d5
      petermaxlee authored
      ## What changes were proposed in this pull request?
      Before this change, FileStreamSource uses an in-memory hash set to track the list of files processed by the engine. The list can grow indefinitely, leading to OOM or overflow of the hash set.
      
      This patch introduces a new user-defined option called "maxFileAge", default to 24 hours. If a file is older than this age, FileStreamSource will purge it from the in-memory map that was used to track the list of files that have been processed.
      
      ## How was this patch tested?
      Added unit tests for the underlying utility, and also added an end-to-end test to validate the purge in FileStreamSourceSuite. Also verified the new test cases would fail when the timeout was set to a very large number.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14728 from petermaxlee/SPARK-17165.
      9812f7d5
Loading