Skip to content
Snippets Groups Projects
  1. Oct 24, 2015
    • Jeffrey Naisbitt's avatar
      [SPARK-11264] bin/spark-class can't find assembly jars with certain GREP_OPTIONS set · 28132ceb
      Jeffrey Naisbitt authored
      Temporarily remove GREP_OPTIONS if set in bin/spark-class.
      
      Some GREP_OPTIONS will modify the output of the grep commands that are looking for the assembly jars.
      For example, if the -n option is specified, the grep output will look like:
      5:spark-assembly-1.5.1-hadoop2.4.0.jar
      
      This will not match the regular expressions, and so the jar files will not be found.  We could improve the regular expression to handle this case and trim off extra characters, but it is difficult to know which options may or may not be set.  Unsetting GREP_OPTIONS within the script handles all the cases and gives the desired output.
      
      Author: Jeffrey Naisbitt <jnaisbitt@familysearch.org>
      
      Closes #9231 from naisbitt/unset-GREP_OPTIONS.
      28132ceb
    • dima's avatar
      [SPARK-11245] update twitter4j to 4.0.4 version · e5bc8c27
      dima authored
      update twitter4j to 4.0.4 version
      https://issues.apache.org/jira/browse/SPARK-11245
      
      Author: dima <pronix.service@gmail.com>
      
      Closes #9221 from pronix/twitter4j_update.
      e5bc8c27
    • Jeff Zhang's avatar
      [SPARK-11125] [SQL] Uninformative exception when running spark-sql witho… · ffed0049
      Jeff Zhang authored
      …ut building with -Phive-thriftserver and SPARK_PREPEND_CLASSES is set
      
      This is the exception after this patch. Please help review.
      ```
      java.lang.NoClassDefFoundError: org/apache/hadoop/hive/cli/CliDriver
      	at java.lang.ClassLoader.defineClass1(Native Method)
      	at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
      	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      	at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
      	at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
      	at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
      	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
      	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:412)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
      	at java.lang.Class.forName0(Native Method)
      	at java.lang.Class.forName(Class.java:270)
      	at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
      	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:647)
      	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
      	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
      	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
      	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.cli.CliDriver
      	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
      	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
      	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
      	... 21 more
      Failed to load hive class.
      You need to build Spark with -Phive and -Phive-thriftserver.
      ```
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #9134 from zjffdu/SPARK-11125.
      ffed0049
  2. Oct 23, 2015
  3. Oct 22, 2015
    • zsxwing's avatar
      [SPARK-11098][CORE] Add Outbox to cache the sending messages to resolve the message disorder issue · a88c66ca
      zsxwing authored
      The current NettyRpc has a message order issue because it uses a thread pool to send messages. E.g., running the following two lines in the same thread,
      
      ```
      ref.send("A")
      ref.send("B")
      ```
      
      The remote endpoint may see "B" before "A" because sending "A" and "B" are in parallel.
      To resolve this issue, this PR added an outbox for each connection, and if we are connecting to the remote node when sending messages, just cache the sending messages in the outbox and send them one by one when the connection is established.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #9197 from zsxwing/rpc-outbox.
      a88c66ca
    • Andrew Or's avatar
      [SPARK-11251] Fix page size calculation in local mode · 34e71c6d
      Andrew Or authored
      ```
      // My machine only has 8 cores
      $ bin/spark-shell --master local[32]
      scala> val df = sc.parallelize(Seq((1, 1), (2, 2))).toDF("a", "b")
      scala> df.as("x").join(df.as("y"), $"x.a" === $"y.a").count()
      
      Caused by: java.io.IOException: Unable to acquire 2097152 bytes of memory
      	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)
      ```
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #9209 from andrewor14/fix-local-page-size.
      34e71c6d
    • Gábor Lipták's avatar
      [SPARK-7021] Add JUnit output for Python unit tests · 163d53e8
      Gábor Lipták authored
      WIP
      
      Author: Gábor Lipták <gliptak@gmail.com>
      
      Closes #8323 from gliptak/SPARK-7021.
      163d53e8
    • Michael Armbrust's avatar
      [SPARK-11116][SQL] First Draft of Dataset API · 53e83a3a
      Michael Armbrust authored
      *This PR adds a new experimental API to Spark, tentitively named Datasets.*
      
      A `Dataset` is a strongly-typed collection of objects that can be transformed in parallel using functional or relational operations.  Example usage is as follows:
      
      ### Functional
      ```scala
      > val ds: Dataset[Int] = Seq(1, 2, 3).toDS()
      > ds.filter(_ % 1 == 0).collect()
      res1: Array[Int] = Array(1, 2, 3)
      ```
      
      ### Relational
      ```scala
      scala> ds.toDF().show()
      +-----+
      |value|
      +-----+
      |    1|
      |    2|
      |    3|
      +-----+
      
      > ds.select(expr("value + 1").as[Int]).collect()
      res11: Array[Int] = Array(2, 3, 4)
      ```
      
      ## Comparison to RDDs
       A `Dataset` differs from an `RDD` in the following ways:
        - The creation of a `Dataset` requires the presence of an explicit `Encoder` that can be
          used to serialize the object into a binary format.  Encoders are also capable of mapping the
          schema of a given object to the Spark SQL type system.  In contrast, RDDs rely on runtime
          reflection based serialization.
        - Internally, a `Dataset` is represented by a Catalyst logical plan and the data is stored
          in the encoded form.  This representation allows for additional logical operations and
          enables many operations (sorting, shuffling, etc.) to be performed without deserializing to
          an object.
      
      A `Dataset` can be converted to an `RDD` by calling the `.rdd` method.
      
      ## Comparison to DataFrames
      
      A `Dataset` can be thought of as a specialized DataFrame, where the elements map to a specific
      JVM object type, instead of to a generic `Row` container. A DataFrame can be transformed into
      specific Dataset by calling `df.as[ElementType]`.  Similarly you can transform a strongly-typed
      `Dataset` to a generic DataFrame by calling `ds.toDF()`.
      
      ## Implementation Status and TODOs
      
      This is a rough cut at the least controversial parts of the API.  The primary purpose here is to get something committed so that we can better parallelize further work and get early feedback on the API.  The following is being deferred to future PRs:
       - Joins and Aggregations (prototype here https://github.com/apache/spark/commit/f11f91e6f08c8cf389b8388b626cd29eec32d937)
       - Support for Java
      
      Additionally, the responsibility for binding an encoder to a given schema is currently done in a fairly ad-hoc fashion.  This is an internal detail, and what we are doing today works for the cases we care about.  However, as we add more APIs we'll probably need to do this in a more principled way (i.e. separate resolution from binding as we do in DataFrames).
      
      ## COMPATIBILITY NOTE
      Long term we plan to make `DataFrame` extend `Dataset[Row]`.  However,
      making this change to che class hierarchy would break the function signatures for the existing
      function operations (map, flatMap, etc).  As such, this class should be considered a preview
      of the final API.  Changes will be made to the interface after Spark 1.6.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #9190 from marmbrus/dataset-infra.
      53e83a3a
    • guoxi's avatar
      [SPARK-11242][SQL] In conf/spark-env.sh.template SPARK_DRIVER_MEMORY is documented incorrectly · 188ea348
      guoxi authored
      Minor fix on the comment
      
      Author: guoxi <guoxi@us.ibm.com>
      
      Closes #9201 from xguo27/SPARK-11242.
      188ea348
    • Cheng Hao's avatar
      [SPARK-9735][SQL] Respect the user specified schema than the infer partition... · d4950e6b
      Cheng Hao authored
      [SPARK-9735][SQL] Respect the user specified schema than the infer partition schema for HadoopFsRelation
      
      To enable the unit test of `hadoopFsRelationSuite.Partition column type casting`. It previously threw exception like below, as we treat the auto infer partition schema with higher priority than the user specified one.
      
      ```
      java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String
      	at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
      	at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:220)
      	at org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:62)
      	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
      	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
      	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
      	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
      	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
      	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
      	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
      	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
      	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
      	at scala.collection.AbstractIterator.to(Iterator.scala:1157)
      	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
      	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
      	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
      	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
      	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
      	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
      	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
      	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
      	at org.apache.spark.scheduler.Task.run(Task.scala:88)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      07:44:01.344 ERROR org.apache.spark.executor.Executor: Exception in task 14.0 in stage 3.0 (TID 206)
      java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String
      	at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
      	at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:220)
      	at org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:62)
      	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
      	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
      	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
      	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
      	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
      	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
      	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
      	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
      	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
      	at scala.collection.AbstractIterator.to(Iterator.scala:1157)
      	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
      	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
      	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
      	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
      	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
      	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
      	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
      	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
      	at org.apache.spark.scheduler.Task.run(Task.scala:88)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      ```
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #8026 from chenghao-intel/partition_discovery.
      d4950e6b
    • Kay Ousterhout's avatar
      [SPARK-11163] Remove unnecessary addPendingTask calls. · 3535b91d
      Kay Ousterhout authored
      This commit removes unnecessary calls to addPendingTask in
      TaskSetManager.executorLost. These calls are unnecessary: for
      tasks that are still pending and haven't been launched, they're
      still in all of the correct pending lists, so calling addPendingTask
      has no effect. For tasks that are currently running (which may still be
      in the pending lists, depending on how they were scheduled), we call
      addPendingTask in handleFailedTask, so the calls at the beginning
      of executorLost are redundant.
      
      I think these calls are left over from when we re-computed the locality
      levels in addPendingTask; now that we call recomputeLocality separately,
      I don't think these are necessary.
      
      Now that those calls are removed, the readding parameter in addPendingTask
      is no longer necessary, so this commit also removes that parameter.
      
      markhamstra can you take a look at this?
      
      cc vanzin
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #9154 from kayousterhout/SPARK-11163.
      3535b91d
    • zsxwing's avatar
      [SPARK-11232][CORE] Use 'offer' instead of 'put' to make sure calling send won't be interrupted · 7bb6d31c
      zsxwing authored
      The current `NettyRpcEndpointRef.send` can be interrupted because it uses `LinkedBlockingQueue.put`, which may hang the application.
      
      Image the following execution order:
      
        | thread 1: TaskRunner.kill | thread 2: TaskRunner.run
      ------------- | ------------- | -------------
      1 | killed = true |
      2 |  | if (killed) {
      3 |  | throw new TaskKilledException
      4 |  | case _: TaskKilledException  _: InterruptedException if task.killed =>
      5 | task.kill(interruptThread): interruptThread is true |
      6 | | execBackend.statusUpdate(taskId, TaskState.KILLED, ser.serialize(TaskKilled))
      7 | | localEndpoint.send(StatusUpdate(taskId, state, serializedData)): in LocalBackend
      
      Then `localEndpoint.send(StatusUpdate(taskId, state, serializedData))` will throw `InterruptedException`. This will prevent the executor from updating the task status and hang the application.
      
      An failure caused by the above issue here: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44062/consoleFull
      
      Since `receivers` is an unbounded `LinkedBlockingQueue`, we can just use `LinkedBlockingQueue.offer` to resolve this issue.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #9198 from zsxwing/dont-interrupt-send.
      7bb6d31c
    • Wenchen Fan's avatar
      [SPARK-11216][SQL][FOLLOW-UP] add encoder/decoder for external row · 42d225f4
      Wenchen Fan authored
      address comments in https://github.com/apache/spark/pull/9184
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9212 from cloud-fan/encoder.
      42d225f4
    • Josh Rosen's avatar
      [SPARK-10708] Consolidate sort shuffle implementations · f6d06adf
      Josh Rosen authored
      There's a lot of duplication between SortShuffleManager and UnsafeShuffleManager. Given that these now provide the same set of functionality, now that UnsafeShuffleManager supports large records, I think that we should replace SortShuffleManager's serialized shuffle implementation with UnsafeShuffleManager's and should merge the two managers together.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #8829 from JoshRosen/consolidate-sort-shuffle-implementations.
      f6d06adf
    • Forest Fang's avatar
      [SPARK-11244][SPARKR] sparkR.stop() should remove SQLContext · 94e2064f
      Forest Fang authored
      SparkR should remove `.sparkRSQLsc` and `.sparkRHivesc` when `sparkR.stop()` is called. Otherwise even when SparkContext is reinitialized, `sparkRSQL.init` returns the stale copy of the object and complains:
      
      ```r
      sc <- sparkR.init("local")
      sqlContext <- sparkRSQL.init(sc)
      sparkR.stop()
      sc <- sparkR.init("local")
      sqlContext <- sparkRSQL.init(sc)
      sqlContext
      ```
      producing
      ```r
      Error in callJMethod(x, "getClass") :
        Invalid jobj 1. If SparkR was restarted, Spark operations need to be re-executed.
      ```
      
      I have added the check and removal only when SparkContext itself is initialized. I have also added corresponding test for this fix. Let me know if you want me to move the test to SQL test suite instead.
      
      p.s. I tried lint-r but ended up a lots of errors on existing code.
      
      Author: Forest Fang <forest.fang@outlook.com>
      
      Closes #9205 from saurfang/sparkR.stop.
      94e2064f
    • zhichao.li's avatar
      [SPARK-11121][CORE] Correct the TaskLocation type · c03b6d11
      zhichao.li authored
      Correct the logic to return `HDFSCacheTaskLocation` instance when the input `str` is a in memory location.
      
      Author: zhichao.li <zhichao.li@intel.com>
      
      Closes #9096 from zhichao-li/uselessBranch.
      c03b6d11
  4. Oct 21, 2015
    • Davies Liu's avatar
      [SPARK-11243][SQL] output UnsafeRow from columnar cache · 1d973327
      Davies Liu authored
      This PR change InMemoryTableScan to output UnsafeRow, and optimize the unrolling and scanning by coping the bytes for var-length types between UnsafeRow and ByteBuffer directly without creating the wrapper objects. When scanning the decimals in TPC-DS store_sales table, it's 80% faster (copy it as long without create Decimal objects).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9203 from davies/unsafe_cache.
      1d973327
    • Yanbo Liang's avatar
      [SPARK-9392][SQL] Dataframe drop should work on unresolved columns · 40a10d76
      Yanbo Liang authored
      Dataframe drop should work on unresolved columns
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #8821 from yanboliang/spark-9392.
      40a10d76
    • Reynold Xin's avatar
      Minor cleanup of ShuffleMapStage.outputLocs code. · 555b2086
      Reynold Xin authored
      I was looking at this code and found the documentation to be insufficient. I added more documentation, and refactored some relevant code path slightly to improve encapsulation. There are more that I want to do, but I want to get these changes in before doing more work.
      
      My goal is to reduce exposing internal fields directly in ShuffleMapStage to improve encapsulation. After this change, DAGScheduler no longer directly writes outputLocs. There are still 3 places that reads outputLocs directly, but we can change those later.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9175 from rxin/stage-cleanup.
      555b2086
    • navis.ryu's avatar
      [SPARK-10151][SQL] Support invocation of hive macro · f481090a
      navis.ryu authored
      Macro in hive (which is GenericUDFMacro) contains real function inside of it but it's not conveyed to tasks, resulting null-pointer exception.
      
      Author: navis.ryu <navis@apache.org>
      
      Closes #8354 from navis/SPARK-10151.
      f481090a
    • Dilip Biswal's avatar
      [SPARK-8654][SQL] Analysis exception when using NULL IN (...) : invalid cast · dce2f8c9
      Dilip Biswal authored
      In the analysis phase , while processing the rules for IN predicate, we
      compare the in-list types to the lhs expression type and generate
      cast operation if necessary. In the case of NULL [NOT] IN expr1 , we end up
      generating cast between in list types to NULL like cast (1 as NULL) which
      is not a valid cast.
      
      The fix is to find a common type between LHS and RHS expressions and cast
      all the expression to the common type.
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Michael Armbrust <michael@databricks.com>
      
      Closes #9036 from dilipbiswal/spark_8654_new.
      dce2f8c9
    • Shagun Sodhani's avatar
      [SPARK-11233][SQL] register cosh in function registry · 19ad1863
      Shagun Sodhani authored
      Author: Shagun Sodhani <sshagunsodhani@gmail.com>
      
      Closes #9199 from shagunsodhani/proposed-fix-#11233.
      19ad1863
    • Artem Aliev's avatar
      [SPARK-11208][SQL] Filter out 'hive.metastore.rawstore.impl' from executionHive temporary config · a37cd870
      Artem Aliev authored
      The executionHive assumed to be a standard meta store located in temporary directory as a derby db. But hive.metastore.rawstore.impl was not filtered out so any custom implementation of the metastore with other storage properties (not JDO) will persist that temporary functions. CassandraHiveMetaStore from DataStax Enterprise is one of examples.
      
      Author: Artem Aliev <artem.aliev@datastax.com>
      
      Closes #9178 from artem-aliev/SPARK-11208.
      a37cd870
    • Yin Huai's avatar
      [SPARK-9740][SPARK-9592][SPARK-9210][SQL] Change the default behavior of... · 3afe448d
      Yin Huai authored
      [SPARK-9740][SPARK-9592][SPARK-9210][SQL] Change the default behavior of First/Last to RESPECT NULLS.
      
      I am changing the default behavior of `First`/`Last` to respect null values (the SQL standard default behavior).
      
      https://issues.apache.org/jira/browse/SPARK-9740
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #8113 from yhuai/firstLast.
      3afe448d
    • Davies Liu's avatar
      [SPARK-11197][SQL] run SQL on files directly · f8c6bec6
      Davies Liu authored
      This PR introduce a new feature to run SQL directly on files without create a table, for example:
      
      ```
      select id from json.`path/to/json/files` as j
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9173 from davies/source.
      f8c6bec6
    • Wenchen Fan's avatar
      [SPARK-10743][SQL] keep the name of expression if possible when do cast · 7c74ebca
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@163.com>
      
      Closes #8859 from cloud-fan/cast.
      7c74ebca
    • Dilip Biswal's avatar
      [SPARK-10534] [SQL] ORDER BY clause allows only columns that are present in... · 49ea0e9d
      Dilip Biswal authored
      [SPARK-10534] [SQL] ORDER BY clause allows only columns that are present in the select projection list
      
      Find out the missing attributes by recursively looking
      at the sort order expression and rest of the code
      takes care of projecting them out.
      
      Added description from cloud-fan
      
      I wanna explain a bit more about this bug.
      
      When we resolve sort ordering, we will use a special method, which only resolves UnresolvedAttributes and UnresolvedExtractValue. However, for something like Floor('a), even the 'a is resolved, the floor expression may still being unresolved as data type mismatch(for example, 'a is string type and Floor need double type), thus can't pass this filter, and we can't push down this missing attribute 'a
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #9123 from dilipbiswal/SPARK-10534.
      49ea0e9d
    • Wenchen Fan's avatar
      [SPARK-11216] [SQL] add encoder/decoder for external row · ccf536f9
      Wenchen Fan authored
      Implement encode/decode for external row based on `ClassEncoder`.
      
      TODO:
      * code cleanup
      * ~~fix corner cases~~
      * refactor the encoder interface
      * improve test for product codegen, to cover more corner cases.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9184 from cloud-fan/encoder.
      ccf536f9
    • nitin goyal's avatar
      [SPARK-11179] [SQL] Push filters through aggregate · f62e3260
      nitin goyal authored
      Push conjunctive predicates though Aggregate operators when their references are a subset of the groupingExpressions.
      
      Query plan before optimisation :-
      Filter ((c#138L = 2) && (a#0 = 3))
       Aggregate [a#0], [a#0,count(b#1) AS c#138L]
        Project [a#0,b#1]
         LocalRelation [a#0,b#1,c#2]
      
      Query plan after optimisation :-
      Filter (c#138L = 2)
       Aggregate [a#0], [a#0,count(b#1) AS c#138L]
        Filter (a#0 = 3)
         Project [a#0,b#1]
          LocalRelation [a#0,b#1,c#2]
      
      Author: nitin goyal <nitin.goyal@guavus.com>
      Author: nitin.goyal <nitin.goyal@guavus.com>
      
      Closes #9167 from nitin2goyal/master.
      f62e3260
Loading