Skip to content
Snippets Groups Projects
  1. Oct 05, 2015
    • zsxwing's avatar
      [SPARK-10900] [STREAMING] Add output operation events to StreamingListener · be7c5ff1
      zsxwing authored
      Add output operation events to StreamingListener so as to implement the following UI features:
      
      1. Progress bar of a batch in the batch list.
      2. Be able to display output operation `description` and `duration` when there is no spark job in a Streaming job.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #8958 from zsxwing/output-operation-events.
      be7c5ff1
    • Wenchen Fan's avatar
      [SPARK-10934] [SQL] handle hashCode of unsafe array correctly · a609eb20
      Wenchen Fan authored
      `Murmur3_x86_32.hashUnsafeWords` only accepts word-aligned bytes, but unsafe array is not.
      
      Author: Wenchen Fan <cloud0fan@163.com>
      
      Closes #8987 from cloud-fan/hash.
      a609eb20
    • Wenchen Fan's avatar
      [SPARK-10585] [SQL] only copy data once when generate unsafe projection · c4871369
      Wenchen Fan authored
      This PR is a completely rewritten of GenerateUnsafeProjection, to accomplish the goal of copying data only once. The old code of GenerateUnsafeProjection is still there to reduce review difficulty.
      
      Instead of creating unsafe conversion code for struct, array and map, we create code of writing the content to the global row buffer.
      
      Author: Wenchen Fan <cloud0fan@163.com>
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #8747 from cloud-fan/copy-once.
      c4871369
  2. Oct 04, 2015
  3. Oct 03, 2015
  4. Oct 02, 2015
  5. Oct 01, 2015
  6. Sep 30, 2015
    • Oscar D. Lara Yejas's avatar
      [SPARK-10807] [SPARKR] Added as.data.frame as a synonym for collect · f21e2da0
      Oscar D. Lara Yejas authored
      Created method as.data.frame as a synonym for collect().
      
      Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>
      Author: olarayej <oscar.lara.yejas@us.ibm.com>
      Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com>
      
      Closes #8908 from olarayej/SPARK-10807.
      f21e2da0
    • Nathan Howell's avatar
      [SPARK-9617] [SQL] Implement json_tuple · 89ea0041
      Nathan Howell authored
      This is an implementation of Hive's `json_tuple` function using Jackson Streaming.
      
      Author: Nathan Howell <nhowell@godaddy.com>
      
      Closes #7946 from NathanHowell/SPARK-9617.
      89ea0041
    • Reynold Xin's avatar
      [SPARK-10770] [SQL] SparkPlan.executeCollect/executeTake should return... · 03cca5dc
      Reynold Xin authored
      [SPARK-10770] [SQL] SparkPlan.executeCollect/executeTake should return InternalRow rather than external Row.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8900 from rxin/SPARK-10770-1.
      03cca5dc
    • Sun Rui's avatar
      [SPARK-10851] [SPARKR] Exception not failing R applications (in yarn cluster mode) · c7b29ae6
      Sun Rui authored
      The YARN backend doesn't like when user code calls System.exit, since it cannot know the exit status and thus cannot set an appropriate final status for the application.
      
      This PR remove the usage of system.exit to exit the RRunner. Instead, when the R process running an SparkR script returns an exit code other than 0, throws SparkUserAppException which will be caught by ApplicationMaster and ApplicationMaster knows it failed. For other failures, throws SparkException.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #8938 from sun-rui/SPARK-10851.
      c7b29ae6
    • Herman van Hovell's avatar
      [SPARK-9741] [SQL] Approximate Count Distinct using the new UDAF interface. · 16fd2a2f
      Herman van Hovell authored
      This PR implements a HyperLogLog based Approximate Count Distinct function using the new UDAF interface.
      
      The implementation is inspired by the ClearSpring HyperLogLog implementation and should produce the same results.
      
      There is still some documentation and testing left to do.
      
      cc yhuai
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #8362 from hvanhovell/SPARK-9741.
      16fd2a2f
    • Yanbo Liang's avatar
      [SPARK-10736] [ML] Use 1 for all ratings if $(ratingCol) = "" · 2931e89f
      Yanbo Liang authored
      For some implicit dataset, ratings may not exist in the training data. In this case, we can assume all observed pairs to be positive and treat their ratings as 1. This should happen when users set ```ratingCol``` to an empty string.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #8937 from yanboliang/spark-10736.
      2931e89f
    • Cheng Lian's avatar
      [SPARK-10811] [SQL] Eliminates unnecessary byte array copying · 4d5a005b
      Cheng Lian authored
      When reading Parquet string and binary-backed decimal values, Parquet `Binary.getBytes` always returns a copied byte array, which is unnecessary. Since the underlying implementation of `Binary` values there is guaranteed to be `ByteArraySliceBackedBinary`, and Parquet itself never reuses underlying byte arrays, we can use `Binary.toByteBuffer.array()` to steal the underlying byte arrays without copying them.
      
      This brings performance benefits when scanning Parquet string and binary-backed decimal columns. Note that, this trick doesn't cover binary-backed decimals with precision greater than 18.
      
      My micro-benchmark result is that, this brings a ~15% performance boost for scanning TPC-DS `store_sales` table (scale factor 15).
      
      Another minor optimization done in this PR is that, now we directly construct a Java `BigDecimal` in `Decimal.toJavaBigDecimal` without constructing a Scala `BigDecimal` first. This brings another ~5% performance gain.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #8907 from liancheng/spark-10811/eliminate-array-copying.
      4d5a005b
  7. Sep 29, 2015
  8. Sep 28, 2015
    • Sean Owen's avatar
      [SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE · bf4199e2
      Sean Owen authored
      In the course of https://issues.apache.org/jira/browse/LEGAL-226 it came to light that the guidance at http://www.apache.org/dev/licensing-howto.html#permissive-deps means that permissively-licensed dependencies has a different interpretation than we (er, I) had been operating under. "pointer ... to the license within the source tree" specifically means a copy of the license within Spark's distribution, whereas at the moment, Spark's LICENSE has a pointer to the project's license in the other project's source tree.
      
      The remedy is simply to inline all such license references (i.e. BSD/MIT licenses) or include their text in "licenses" subdirectory and point to that.
      
      Along the way, we can also treat other BSD/MIT licenses, whose text has been inlined into LICENSE, in the same way.
      
      The LICENSE file can continue to provide a helpful list of BSD/MIT licensed projects and a pointer to their sites. This would be over and above including license text in the distro, which is the essential thing.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #8919 from srowen/SPARK-10833.
      bf4199e2
    • Davies Liu's avatar
      [SPARK-10859] [SQL] fix stats of StringType in columnar cache · ea02e551
      Davies Liu authored
      The UTF8String may come from UnsafeRow, then underline buffer of it is not copied, so we should clone it in order to hold it in Stats.
      
      cc yhuai
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8929 from davies/pushdown_string.
      ea02e551
    • Cheng Lian's avatar
      [SPARK-10395] [SQL] Simplifies CatalystReadSupport · 14978b78
      Cheng Lian authored
      Please refer to [SPARK-10395] [1] for details.
      
      [1]: https://issues.apache.org/jira/browse/SPARK-10395
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #8553 from liancheng/spark-10395/simplify-parquet-read-support.
      14978b78
    • jerryshao's avatar
      [SPARK-10790] [YARN] Fix initial executor number not set issue and consolidate the codes · 353c30bd
      jerryshao authored
      This bug is introduced in [SPARK-9092](https://issues.apache.org/jira/browse/SPARK-9092), `targetExecutorNumber` should use `minExecutors` if `initialExecutors` is not set. Using 0 instead will meet the problem as mentioned in [SPARK-10790](https://issues.apache.org/jira/browse/SPARK-10790).
      
      Also consolidate and simplify some similar code snippets to keep the consistent semantics.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #8910 from jerryshao/SPARK-10790.
      353c30bd
    • Holden Karau's avatar
      [SPARK-10812] [YARN] Spark hadoop util support switching to yarn · d8d50ed3
      Holden Karau authored
      While this is likely not a huge issue for real production systems, for test systems which may setup a Spark Context and tear it down and stand up a Spark Context with a different master (e.g. some local mode & some yarn mode) tests this cane be an issue. Discovered during work on spark-testing-base on Spark 1.4.1, but seems like the logic that triggers it is present in master (see SparkHadoopUtil object). A valid work around for users encountering this issue is to fork a different JVM, however this can be heavy weight.
      
      ```
      [info] SampleMiniClusterTest:
      [info] Exception encountered when attempting to run a suite with class name: com.holdenkarau.spark.testing.SampleMiniClusterTest *** ABORTED ***
      [info] java.lang.ClassCastException: org.apache.spark.deploy.SparkHadoopUtil cannot be cast to org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
      [info] at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:163)
      [info] at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:257)
      [info] at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561)
      [info] at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
      [info] at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
      [info] at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
      [info] at org.apache.spark.SparkContext.<init>(SparkContext.scala:497)
      [info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.setup(SharedMiniCluster.scala:186)
      [info] at com.holdenkarau.spark.testing.SampleMiniClusterTest.setup(SampleMiniClusterTest.scala:26)
      [info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.beforeAll(SharedMiniCluster.scala:103)
      ```
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #8911 from holdenk/SPARK-10812-spark-hadoop-util-support-switching-to-yarn.
      d8d50ed3
    • David Martin's avatar
      Fix two mistakes in programming-guide page · b5824993
      David Martin authored
      seperate -> separate
      sees -> see
      
      Author: David Martin <dmartinpro@users.noreply.github.com>
      
      Closes #8928 from dmartinpro/patch-1.
      b5824993
  9. Sep 27, 2015
Loading