Skip to content
Snippets Groups Projects
  1. Nov 05, 2015
  2. Nov 04, 2015
    • Davies Liu's avatar
      [SPARK-11425] [SPARK-11486] Improve hybrid aggregation · 81498dd5
      Davies Liu authored
      After aggregation, the dataset could be smaller than inputs, so it's better to do hash based aggregation for all inputs, then using sort based aggregation to merge them.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9383 from davies/fix_switch.
      81498dd5
    • Josh Rosen's avatar
      [SPARK-11307] Reduce memory consumption of OutputCommitCoordinator · d0b56339
      Josh Rosen authored
      OutputCommitCoordinator uses a map in a place where an array would suffice, increasing its memory consumption for result stages with millions of tasks.
      
      This patch replaces that map with an array. The only tricky part of this is reasoning about the range of possible array indexes in order to make sure that we never index out of bounds.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9274 from JoshRosen/SPARK-11307.
      d0b56339
    • Zhenhua Wang's avatar
      [SPARK-11398] [SQL] unnecessary def dialectClassName in HiveContext, and... · a752ddad
      Zhenhua Wang authored
      [SPARK-11398] [SQL] unnecessary def dialectClassName in HiveContext, and misleading dialect conf at the start of spark-sql
      
      1. def dialectClassName in HiveContext is unnecessary.
      In HiveContext, if conf.dialect == "hiveql", getSQLDialect() will return new HiveQLDialect(this);
      else it will use super.getSQLDialect(). Then in super.getSQLDialect(), it calls dialectClassName, which is overriden in HiveContext and still return super.dialectClassName.
      So we'll never reach the code "classOf[HiveQLDialect].getCanonicalName" of def dialectClassName in HiveContext.
      
      2. When we start bin/spark-sql, the default context is HiveContext, and the corresponding dialect is hiveql.
      However, if we type "set spark.sql.dialect;", the result is "sql", which is inconsistent with the actual dialect and is misleading. For example, we can use sql like "create table" which is only allowed in hiveql, but this dialect conf shows it's "sql".
      Although this problem will not cause any execution error, it's misleading to spark sql users. Therefore I think we should fix it.
      In this pr, while procesing “set spark.sql.dialect” in SetCommand, I use "conf.dialect" instead of "getConf()" for the case of key == SQLConf.DIALECT.key, so that it will return the right dialect conf.
      
      Author: Zhenhua Wang <wangzhenhua@huawei.com>
      
      Closes #9349 from wzhfy/dialect.
      a752ddad
    • Josh Rosen's avatar
      [SPARK-11491] Update build to use Scala 2.10.5 · ce5e6a28
      Josh Rosen authored
      Spark should build against Scala 2.10.5, since that includes a fix for Scaladoc that will fix doc snapshot publishing: https://issues.scala-lang.org/browse/SI-8479
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9450 from JoshRosen/upgrade-to-scala-2.10.5.
      ce5e6a28
    • Reynold Xin's avatar
      [SPARK-11510][SQL] Remove SQL aggregation tests for higher order statistics · b6e0a5ae
      Reynold Xin authored
      We have some aggregate function tests in both DataFrameAggregateSuite and SQLQuerySuite. The two have almost the same coverage and we should just remove the SQL one.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9475 from rxin/SPARK-11510.
      b6e0a5ae
    • Yu ISHIKAWA's avatar
      [SPARK-10028][MLLIB][PYTHON] Add Python API for PrefixSpan · 411ff6af
      Yu ISHIKAWA authored
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #9469 from yu-iskw/SPARK-10028.
      411ff6af
    • Davies Liu's avatar
      [SPARK-11493] remove bitset from BytesToBytesMap · 1b6a5d4a
      Davies Liu authored
      Since we have 4 bytes as number of records in the beginning of a page, the address can not be zero, so we do not need the bitset.
      
      For performance concerns, the bitset could help speed up false lookup if the slot is empty (because bitset is smaller than longArray, cache hit rate will be higher). In practice, the map is filled with 35% - 70% (use 50% as average), so only half of the false lookups can benefit of it, all others will pay the cost of load the bitset (still need to access the longArray anyway).
      
      For aggregation, we always need to access the longArray (insert a new key after false lookup), also confirmed by a benchmark.
      
       For broadcast hash join, there could be a regression, but a simple benchmark showed that it may not (most of lookup are false):
      
      ```
      sqlContext.range(1<<20).write.parquet("small")
      df = sqlContext.read.parquet('small')
      for i in range(3):
          t = time.time()
          df2 = sqlContext.range(1<<26).selectExpr("id * 1111111111 % 987654321 as id2")
          df2.join(df, df.id == df2.id2).count()
          print time.time() -t
      ```
      
      Having bitset (used time in seconds):
      ```
      17.5404241085
      10.2758829594
      10.5786800385
      ```
      After removing bitset (used time in seconds):
      ```
      21.8939979076
      12.4132959843
      9.97224712372
      ```
      
      cc rxin nongli
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9452 from davies/remove_bitset.
      1b6a5d4a
    • Adam Roberts's avatar
      [SPARK-10949] Update Snappy version to 1.1.2 · 701fb505
      Adam Roberts authored
      This is an updated version of #8995 by a-roberts. Original description follows:
      
      Snappy now supports concatenation of serialized streams, this patch contains a version number change and the "does not support" test is now a "supports" test.
      
      Snappy 1.1.2 changelog mentions:
      
      > snappy-java-1.1.2 (22 September 2015)
      > This is a backward compatible release for 1.1.x.
      > Add AIX (32-bit) support.
      > There is no upgrade for the native libraries of the other platforms.
      
      > A major change since 1.1.1 is a support for reading concatenated results of SnappyOutputStream(s)
      > snappy-java-1.1.2-RC2 (18 May 2015)
      > Fix #107: SnappyOutputStream.close() is not idempotent
      > snappy-java-1.1.2-RC1 (13 May 2015)
      > SnappyInputStream now supports reading concatenated compressed results of SnappyOutputStream
      > There has been no compressed format change since 1.0.5.x. So You can read the compressed results > interchangeablly between these versions.
      > Fixes a problem when java.io.tmpdir does not exist.
      
      Closes #8995.
      
      Author: Adam Roberts <aroberts@uk.ibm.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9439 from JoshRosen/update-snappy.
      701fb505
    • Reynold Xin's avatar
      [SPARK-11505][SQL] Break aggregate functions into multiple files · d19f4fda
      Reynold Xin authored
      functions.scala was getting pretty long. I broke it into multiple files.
      
      I also added explicit data types for some public vals, and renamed aggregate function pretty names to lower case, which is more consistent with rest of the functions.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9471 from rxin/SPARK-11505.
      d19f4fda
    • Reynold Xin's avatar
      [SPARK-11504][SQL] API audit for distributeBy and localSort · abf5e428
      Reynold Xin authored
      1. Renamed localSort -> sortWithinPartitions to avoid ambiguity in "local"
      2. distributeBy -> repartition to match the existing repartition.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9470 from rxin/SPARK-11504.
      abf5e428
    • Liang-Chi Hsieh's avatar
      [SPARK-10304][SQL] Following up checking valid dir structure for partition discovery · de289bf2
      Liang-Chi Hsieh authored
      This patch follows up #8840.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #9459 from viirya/detect_invalid_part_dir_following.
      de289bf2
    • Reynold Xin's avatar
      Closes #9464 · 987df4bf
      Reynold Xin authored
      987df4bf
    • Reynold Xin's avatar
      [SPARK-11490][SQL] variance should alias var_samp instead of var_pop. · 3bd6f5d2
      Reynold Xin authored
      stddev is an alias for stddev_samp. variance should be consistent with stddev.
      
      Also took the chance to remove internal Stddev and Variance, and only kept StddevSamp/StddevPop and VarianceSamp/VariancePop.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9449 from rxin/SPARK-11490.
      3bd6f5d2
    • Wenchen Fan's avatar
      [SPARK-11197][SQL] add doc for run SQL on files directly · e0fc9c7e
      Wenchen Fan authored
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9467 from cloud-fan/doc.
      e0fc9c7e
    • Reynold Xin's avatar
      [SPARK-11485][SQL] Make DataFrameHolder and DatasetHolder public. · cd1df662
      Reynold Xin authored
      These two classes should be public, since they are used in public code.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9445 from rxin/SPARK-11485.
      cd1df662
    • Marcelo Vanzin's avatar
      [SPARK-11235][NETWORK] Add ability to stream data using network lib. · 27feafcc
      Marcelo Vanzin authored
      The current interface used to fetch shuffle data is not very efficient for
      large buffers; it requires the receiver to buffer the entirety of the
      contents being downloaded in memory before processing the data.
      
      To use the network library to transfer large files (such as those that
      can be added using SparkContext addJar / addFile), this change adds a
      more efficient way of downloding data, by streaming the data and feeding
      it to a callback as data arrives.
      
      This is achieved by a custom frame decoder that replaces the current netty
      one; this decoder allows entering a mode where framing is skipped and data
      is instead provided directly to a callback. The existing netty classes
      (ByteToMessageDecoder and LengthFieldBasedFrameDecoder) could not be reused
      since their semantics do not allow for the interception approach the new
      decoder uses.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #9206 from vanzin/SPARK-11235.
      27feafcc
    • Marcelo Vanzin's avatar
      [SPARK-10622][CORE][YARN] Differentiate dead from "mostly dead" executors. · 8790ee6d
      Marcelo Vanzin authored
      In YARN mode, when preemption is enabled, we may leave executors in a
      zombie state while we wait to retrieve the reason for which the executor
      exited. This is so that we don't account for failed tasks that were
      running on a preempted executor.
      
      The issue is that while we wait for this information, the scheduler
      might decide to schedule tasks on the executor, which will never be
      able to run them. Other side effects include the block manager still
      considering the executor available to cache blocks, for example.
      
      So, when we know that an executor went down but we don't know why,
      stop everything related to the executor, except its running tasks.
      Only when we know the reason for the exit (or give up waiting for
      it) we do update the running tasks.
      
      This is achieved by a new `disableExecutor()` method in the
      `Schedulable` interface. For managers that do not behave like this
      (i.e. every one but YARN), the existing `executorLost()` method
      will behave the same way it did before.
      
      On top of that change, a few minor changes that made debugging easier,
      and fixed some other minor issues:
      - The cluster-mode AM was printing a misleading log message every
        time an executor disconnected from the driver (because the akka
        actor system was shared between driver and AM).
      - Avoid sending unnecessary requests for an executor's exit reason
        when we already know it was explicitly disabled / killed. This
        avoids both multiple requests, and unnecessary requests that would
        just cause warning messages on the AM (in the explicit kill case).
      - Tone down a log message about the executor being lost when it
        exited normally (e.g. preemption)
      - Wake up the AM monitor thread when requests for executor loss
        reasons arrive too, so that we can more quickly remove executors
        from this zombie state.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #8887 from vanzin/SPARK-10622.
      8790ee6d
    • Xusen Yin's avatar
      [SPARK-11443] Reserve space lines · 9b214cea
      Xusen Yin authored
      The trim_codeblock(lines) function in include_example.rb removes some blank lines in the code.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #9400 from yinxusen/SPARK-11443.
      9b214cea
Loading