Skip to content
Snippets Groups Projects
  1. Jul 20, 2015
    • Mateusz Buśkiewicz's avatar
      [SPARK-9101] [PySpark] Add missing NullType · 02181fb6
      Mateusz Buśkiewicz authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9101
      
      Author: Mateusz Buśkiewicz <mateusz.buskiewicz@getbase.com>
      
      Closes #7499 from sixers/spark-9101 and squashes the following commits:
      
      dd75aa6 [Mateusz Buśkiewicz] [SPARK-9101] [PySpark] Test for selecting null literal
      97e3f2f [Mateusz Buśkiewicz] [SPARK-9101] [PySpark] Add missing NullType to _atomic_types in pyspark.sql.types
      02181fb6
    • Imran Rashid's avatar
      [SPARK-8103][core] DAGScheduler should not submit multiple concurrent attempts for a stage · 80e2568b
      Imran Rashid authored
      https://issues.apache.org/jira/browse/SPARK-8103
      
      cc kayousterhout (thanks for the extra test case)
      
      Author: Imran Rashid <irashid@cloudera.com>
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      Author: Imran Rashid <squito@users.noreply.github.com>
      
      Closes #6750 from squito/SPARK-8103 and squashes the following commits:
      
      fb3acfc [Imran Rashid] fix log msg
      e01b7aa [Imran Rashid] fix some comments, style
      584acd4 [Imran Rashid] simplify going from taskId to taskSetMgr
      e43ac25 [Imran Rashid] Merge branch 'master' into SPARK-8103
      6bc23af [Imran Rashid] update log msg
      4470fa1 [Imran Rashid] rename
      c04707e [Imran Rashid] style
      88b61cc [Imran Rashid] add tests to make sure that TaskSchedulerImpl schedules correctly with zombie attempts
      d7f1ef2 [Imran Rashid] get rid of activeTaskSets
      a21c8b5 [Imran Rashid] Merge branch 'master' into SPARK-8103
      906d626 [Imran Rashid] fix merge
      109900e [Imran Rashid] Merge branch 'master' into SPARK-8103
      c0d4d90 [Imran Rashid] Revert "Index active task sets by stage Id rather than by task set id"
      f025154 [Imran Rashid] Merge pull request #2 from kayousterhout/imran_SPARK-8103
      baf46e1 [Kay Ousterhout] Index active task sets by stage Id rather than by task set id
      19685bb [Imran Rashid] switch to using latestInfo.attemptId, and add comments
      a5f7c8c [Imran Rashid] remove comment for reviewers
      227b40d [Imran Rashid] style
      517b6e5 [Imran Rashid] get rid of SparkIllegalStateException
      b2faef5 [Imran Rashid] faster check for conflicting task sets
      6542b42 [Imran Rashid] remove extra stageAttemptId
      ada7726 [Imran Rashid] reviewer feedback
      d8eb202 [Imran Rashid] Merge branch 'master' into SPARK-8103
      46bc26a [Imran Rashid] more cleanup of debug garbage
      cb245da [Imran Rashid] finally found the issue ... clean up debug stuff
      8c29707 [Imran Rashid] Merge branch 'master' into SPARK-8103
      89a59b6 [Imran Rashid] more printlns ...
      9601b47 [Imran Rashid] more debug printlns
      ecb4e7d [Imran Rashid] debugging printlns
      b6bc248 [Imran Rashid] style
      55f4a94 [Imran Rashid] get rid of more random test case since kays tests are clearer
      7021d28 [Imran Rashid] update test since listenerBus.waitUntilEmpty now throws an exception instead of returning a boolean
      883fe49 [Kay Ousterhout] Unit tests for concurrent stages issue
      6e14683 [Imran Rashid] unit test just to make sure we fail fast on concurrent attempts
      06a0af6 [Imran Rashid] ignore for jenkins
      c443def [Imran Rashid] better fix and simpler test case
      28d70aa [Imran Rashid] wip on getting a better test case ...
      a9bf31f [Imran Rashid] wip
      80e2568b
    • Reynold Xin's avatar
      [SQL] Remove space from DataFrame Scala/Java API. · c6fe9b4a
      Reynold Xin authored
      I don't think this function is useful at all in Scala/Java, since users can easily compute n * space easily.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7530 from rxin/remove-space and squashes the following commits:
      
      c147873 [Reynold Xin] [SQL] Remove space from DataFrame Scala/Java API.
      c6fe9b4a
    • Wenchen Fan's avatar
      [SPARK-9186][SQL] make deterministic describing the tree rather than the expression · 04db58ae
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7525 from cloud-fan/deterministic and squashes the following commits:
      
      4189bfa [Wenchen Fan] make deterministic describing the tree rather than the expression
      04db58ae
    • Tarek Auel's avatar
      [SPARK-9177][SQL] Reuse of calendar object in WeekOfYear · a15ecd05
      Tarek Auel authored
      https://issues.apache.org/jira/browse/SPARK-9177
      
      rxin Are we sure that this is thread safe? chenghao-intel explained in another PR that every partition (if I remember correctly) uses one expression instance. This instance isn't used by multiple threads, is it? If not, we are fine.
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7516 from tarekauel/SPARK-9177 and squashes the following commits:
      
      0c1313a [Tarek Auel] [SPARK-9177] utilize more powerful addMutableState
      6e2f03f [Tarek Auel] Merge branch 'master' into SPARK-9177
      a69ec92 [Tarek Auel] [SPARK-9177] address comment
      6cfb180 [Tarek Auel] [SPARK-9177] calendar as lazy transient val
      ff97b09 [Tarek Auel] [SPARK-9177] Reuse calendar object in interpreted code and codegen
      a15ecd05
    • Tarek Auel's avatar
      [SPARK-9153][SQL] codegen StringLPad/StringRPad · 5112b7f5
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-9153
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7527 from tarekauel/SPARK-9153 and squashes the following commits:
      
      3840c6b [Tarek Auel] [SPARK-9153] removed codegen fallback
      92b6a5d [Tarek Auel] [SPARK-9153] codegen lpad/rpad
      5112b7f5
    • MechCoder's avatar
      [SPARK-8996] [MLLIB] [PYSPARK] Python API for Kolmogorov-Smirnov Test · d0b4e93f
      MechCoder authored
      Python API for the KS-test
      
      Statistics.kolmogorovSmirnovTest(data, distName, *params)
      I'm not quite sure how to support the callable function since it is not serializable.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #7430 from MechCoder/spark-8996 and squashes the following commits:
      
      2dd009d [MechCoder] minor
      021d233 [MechCoder] Remove one wrapper and other minor stuff
      49d07ab [MechCoder] [SPARK-8996] [MLlib] Python API for Kolmogorov-Smirnov Test
      d0b4e93f
    • George Dittmar's avatar
      [SPARK-7422] [MLLIB] Add argmax to Vector, SparseVector · 3f7de7db
      George Dittmar authored
      Modifying Vector, DenseVector, and SparseVector to implement argmax functionality. This work is to set the stage for changes to be done in Spark-7423.
      
      Author: George Dittmar <georgedittmar@gmail.com>
      Author: George <dittmar@Georges-MacBook-Pro.local>
      Author: dittmarg <george.dittmar@webtrends.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6112 from GeorgeDittmar/SPARK-7422 and squashes the following commits:
      
      3e0a939 [George Dittmar] Merge pull request #1 from mengxr/SPARK-7422
      127dec5 [Xiangrui Meng] update argmax impl
      2ea6a55 [George Dittmar] Added MimaExcludes for Vectors.argmax
      98058f4 [George Dittmar] Merge branch 'master' of github.com:apache/spark into SPARK-7422
      5fd9380 [George Dittmar] fixing style check error
      42341fb [George Dittmar] refactoring arg max check to better handle zero values
      b22af46 [George Dittmar] Fixing spaces between commas in unit test
      f2eba2f [George Dittmar] Cleaning up unit tests to be fewer lines
      aa330e3 [George Dittmar] Fixing some last if else spacing issues
      ac53c55 [George Dittmar] changing dense vector argmax unit test to be one line call vs 2
      d5b5423 [George Dittmar] Fixing code style and updating if logic on when to check for zero values
      ee1a85a [George Dittmar] Cleaning up unit tests a bit and modifying a few cases
      3ee8711 [George Dittmar] Fixing corner case issue with zeros in the active values of the sparse vector. Updated unit tests
      b1f059f [George Dittmar] Added comment before we start arg max calculation. Updated unit tests to cover corner cases
      f21dcce [George Dittmar] commit
      af17981 [dittmarg] Initial work fixing bug that was made clear in pr
      eeda560 [George] Fixing SparseVector argmax function to ignore zero values while doing the calculation.
      4526acc [George] Merge branch 'master' of github.com:apache/spark into SPARK-7422
      df9538a [George] Added argmax to sparse vector and added unit test
      3cffed4 [George] Adding unit tests for argmax functions for Dense and Sparse vectors
      04677af [George] initial work on adding argmax to Vector and SparseVector
      3f7de7db
    • Josh Rosen's avatar
      [SPARK-9023] [SQL] Efficiency improvements for UnsafeRows in Exchange · 79ec0729
      Josh Rosen authored
      This pull request aims to improve the performance of SQL's Exchange operator when shuffling UnsafeRows.  It also makes several general efficiency improvements to Exchange.
      
      Key changes:
      
      - When performing hash partitioning, the old Exchange projected the partitioning columns into a new row then passed a `(partitioningColumRow: InternalRow, row: InternalRow)` pair into the shuffle. This is very inefficient because it ends up redundantly serializing the partitioning columns only to immediately discard them after the shuffle.  After this patch's changes, Exchange now shuffles `(partitionId: Int, row: InternalRow)` pairs.  This still isn't optimal, since we're still shuffling extra data that we don't need, but it's significantly more efficient than the old implementation; in the future, we may be able to further optimize this once we implement a new shuffle write interface that accepts non-key-value-pair inputs.
      - Exchange's `compute()` method has been significantly simplified; the new code has less duplication and thus is easier to understand.
      - When the Exchange's input operator produces UnsafeRows, Exchange will use a specialized `UnsafeRowSerializer` to serialize these rows.  This serializer is significantly more efficient since it simply copies the UnsafeRow's underlying bytes.  Note that this approach does not work for UnsafeRows that use the ObjectPool mechanism; I did not add support for this because we are planning to remove ObjectPool in the next few weeks.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7456 from JoshRosen/unsafe-exchange and squashes the following commits:
      
      7e75259 [Josh Rosen] Fix cast in SparkSqlSerializer2Suite
      0082515 [Josh Rosen] Some additional comments + small cleanup to remove an unused parameter
      a27cfc1 [Josh Rosen] Add missing newline
      741973c [Josh Rosen] Add simple test of UnsafeRow shuffling in Exchange.
      359c6a4 [Josh Rosen] Remove println() and add comments
      93904e7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-exchange
      8dd3ff2 [Josh Rosen] Exchange outputs UnsafeRows when its child outputs them
      dd9c66d [Josh Rosen] Fix for copying logic
      035af21 [Josh Rosen] Add logic for choosing when to use UnsafeRowSerializer
      7876f31 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-shuffle
      cbea80b [Josh Rosen] Add UnsafeRowSerializer
      0f2ac86 [Josh Rosen] Import ordering
      3ca8515 [Josh Rosen] Big code simplification in Exchange
      3526868 [Josh Rosen] Iniitial cut at removing shuffle on KV pairs
      79ec0729
    • Jacky Li's avatar
      [SQL][DOC] Minor document fix in HadoopFsRelationProvider · 972d8900
      Jacky Li authored
      Catch this while reading the code
      
      Author: Jacky Li <lee.unreal@gmail.com>
      Author: Jacky Li <jackylk@users.noreply.github.com>
      
      Closes #7524 from jackylk/patch-11 and squashes the following commits:
      
      b679011 [Jacky Li] fix doc
      e10e211 [Jacky Li] [SQL] Minor document fix in HadoopFsRelationProvider
      972d8900
    • Reynold Xin's avatar
      5bdf16da
    • Wenchen Fan's avatar
      [SPARK-9185][SQL] improve code gen for mutable states to support complex initialization · 930253e0
      Wenchen Fan authored
      Sometimes we need more than one step to initialize the mutable states in code gen like https://github.com/apache/spark/pull/7516
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7521 from cloud-fan/init and squashes the following commits:
      
      2106445 [Wenchen Fan] improve code gen for mutable states
      930253e0
  2. Jul 19, 2015
    • Liang-Chi Hsieh's avatar
      [SPARK-9172][SQL] Make DecimalPrecision support for Intersect and Except · d743bec6
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9172
      
      Simply make `DecimalPrecision` support for `Intersect` and `Except` in addition to `Union`.
      
      Besides, add unit test for `DecimalPrecision` as well.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #7511 from viirya/more_decimalprecieion and squashes the following commits:
      
      4d29d10 [Liang-Chi Hsieh] Fix code comment.
      9fb0d49 [Liang-Chi Hsieh] Make DecimalPrecision support for Intersect and Except.
      d743bec6
    • Tathagata Das's avatar
      [SPARK-9030] [STREAMING] [HOTFIX] Make sure that no attempts to create Kinesis... · 93eb2acf
      Tathagata Das authored
      [SPARK-9030] [STREAMING] [HOTFIX] Make sure that no attempts to create Kinesis streams is made when no enabled
      
      Problem: Even when the environment variable to enable tests are disabled, the `beforeAll` of the KinesisStreamSuite attempted to find AWS credentials to create Kinesis stream, and failed.
      
      Solution: Made sure all accesses to KinesisTestUtils, that created streams, is under `testOrIgnore`
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #7519 from tdas/kinesis-tests and squashes the following commits:
      
      64d6d06 [Tathagata Das] Removed empty lines.
      7c18473 [Tathagata Das] Putting all access to KinesisTestUtils inside testOrIgnore
      93eb2acf
    • Reynold Xin's avatar
      [SPARK-8241][SQL] string function: concat_ws. · 163e3f1d
      Reynold Xin authored
      I also changed the semantics of concat w.r.t. null back to the same behavior as Hive.
      That is to say, concat now returns null if any input is null.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7504 from rxin/concat_ws and squashes the following commits:
      
      83fd950 [Reynold Xin] Fixed type casting.
      3ae85f7 [Reynold Xin] Write null better.
      cdc7be6 [Reynold Xin] Added code generation for pure string mode.
      a61c4e4 [Reynold Xin] Updated comments.
      2d51406 [Reynold Xin] [SPARK-8241][SQL] string function: concat_ws.
      163e3f1d
    • Herman van Hovell's avatar
      [SPARK-8638] [SQL] Window Function Performance Improvements - Cleanup · 7a812453
      Herman van Hovell authored
      This PR contains a few clean-ups that are a part of SPARK-8638: a few style issues got fixed, and a few tests were moved.
      
      Git commit message is wrong BTW :(...
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #7513 from hvanhovell/SPARK-8638-cleanup and squashes the following commits:
      
      4e69d08 [Herman van Hovell] Fixed Perfomance Regression for Shrinking Window Frames (+Rebase)
      7a812453
    • Nicholas Hwang's avatar
      [SPARK-9021] [PYSPARK] Change RDD.aggregate() to do reduce(mapPartitions())... · a803ac3e
      Nicholas Hwang authored
      [SPARK-9021] [PYSPARK] Change RDD.aggregate() to do reduce(mapPartitions()) instead of mapPartitions.fold()
      
      I'm relatively new to Spark and functional programming, so forgive me if this pull request is just a result of my misunderstanding of how Spark should be used.
      
      Currently, if one happens to use a mutable object as `zeroValue` for `RDD.aggregate()`, possibly unexpected behavior can occur.
      
      This is because pyspark's current implementation of `RDD.aggregate()` does not serialize or make a copy of `zeroValue` before handing it off to `RDD.mapPartitions(...).fold(...)`. This results in a single reference to `zeroValue` being used for both `RDD.mapPartitions()` and `RDD.fold()` on each partition. This can result in strange accumulator values being fed into each partition's call to `RDD.fold()`, as the `zeroValue` may have been changed in-place during the `RDD.mapPartitions()` call.
      
      As an illustrative example, submit the following to `spark-submit`:
      ```
      from pyspark import SparkConf, SparkContext
      import collections
      
      def updateCounter(acc, val):
          print 'update acc:', acc
          print 'update val:', val
          acc[val] += 1
          return acc
      
      def comboCounter(acc1, acc2):
          print 'combo acc1:', acc1
          print 'combo acc2:', acc2
          acc1.update(acc2)
          return acc1
      
      def main():
          conf = SparkConf().setMaster("local").setAppName("Aggregate with Counter")
          sc = SparkContext(conf = conf)
      
          print '======= AGGREGATING with ONE PARTITION ======='
          print sc.parallelize(range(1,10), 1).aggregate(collections.Counter(), updateCounter, comboCounter)
      
          print '======= AGGREGATING with TWO PARTITIONS ======='
          print sc.parallelize(range(1,10), 2).aggregate(collections.Counter(), updateCounter, comboCounter)
      
      if __name__ == "__main__":
          main()
      ```
      
      One probably expects this to output the following:
      ```
      Counter({1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1})
      ```
      
      But it instead outputs this (regardless of the number of partitions):
      ```
      Counter({1: 2, 2: 2, 3: 2, 4: 2, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2})
      ```
      
      This is because (I believe) `zeroValue` gets passed correctly to each partition, but after `RDD.mapPartitions()` completes, the `zeroValue` object has been updated and is then passed to `RDD.fold()`, which results in all items being double-counted within each partition before being finally reduced at the calling node.
      
      I realize that this type of calculation is typically done by `RDD.mapPartitions(...).reduceByKey(...)`, but hopefully this illustrates some potentially confusing behavior. I also noticed that other `RDD` methods use this `deepcopy` approach to creating unique copies of `zeroValue` (i.e., `RDD.aggregateByKey()` and `RDD.foldByKey()`), and that the Scala implementations do seem to serialize the `zeroValue` object appropriately to prevent this type of behavior.
      
      Author: Nicholas Hwang <moogling@gmail.com>
      
      Closes #7378 from njhwang/master and squashes the following commits:
      
      659bb27 [Nicholas Hwang] Fixed RDD.aggregate() to perform a reduce operation on collected mapPartitions results, similar to how fold currently is implemented. This prevents an initial combOp being performed on each partition with zeroValue (which leads to unexpected behavior if zeroValue is a mutable object) before being combOp'ed with other partition results.
      8d8d694 [Nicholas Hwang] Changed dict construction to be compatible with Python 2.6 (cannot use list comprehensions to make dicts)
      56eb2ab [Nicholas Hwang] Fixed whitespace after colon to conform with PEP8
      391de4a [Nicholas Hwang] Removed used of collections.Counter from RDD tests for Python 2.6 compatibility; used defaultdict(int) instead. Merged treeAggregate test with mutable zero value into aggregate test to reduce code duplication.
      2fa4e4b [Nicholas Hwang] Merge branch 'master' of https://github.com/njhwang/spark
      ba528bd [Nicholas Hwang] Updated comments regarding protection of zeroValue from mutation in RDD.aggregate(). Added regression tests for aggregate(), fold(), aggregateByKey(), foldByKey(), and treeAggregate(), all with both 1 and 2 partition RDDs. Confirmed that aggregate() is the only problematic implementation as of commit 257236c3. Also replaced some parallelizations of ranges with xranges, per the documentation's recommendations of preferring xrange over range.
      7820391 [Nicholas Hwang] Updated comments regarding protection of zeroValue from mutation in RDD.aggregate(). Added regression tests for aggregate(), fold(), aggregateByKey(), foldByKey(), and treeAggregate(), all with both 1 and 2 partition RDDs. Confirmed that aggregate() is the only problematic implementation as of commit 257236c3.
      90d1544 [Nicholas Hwang] Made sure RDD.aggregate() makes a deepcopy of zeroValue for all partitions; this ensures that the mapPartitions call works with unique copies of zeroValue in each partition, and prevents a single reference to zeroValue being used for both map and fold calls on each partition (resulting in possibly unexpected behavior).
      a803ac3e
    • Cheng Lian's avatar
      [HOTFIX] [SQL] Fixes compilation error introduced by PR #7506 · 34ed82bb
      Cheng Lian authored
      PR #7506 breaks master build because of compilation error. Note that #7506 itself looks good, but it seems that `git merge` did something stupid.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7510 from liancheng/hotfix-for-pr-7506 and squashes the following commits:
      
      7ea7e89 [Cheng Lian] Fixes compilation error
      34ed82bb
    • Cheng Lian's avatar
      [SPARK-9179] [BUILD] Allows committers to specify primary author of the PR to be merged · bc24289f
      Cheng Lian authored
      It's a common case that some contributor contributes an initial version of a feature/bugfix, and later on some other people (mostly committers) fork and add more improvements. When merging these PRs, we probably want to specify the original author as the primary author. Currently we can only do this by running
      
      ```
      $ git commit --amend --author="name <email>"
      ```
      
      manually right before the merge script pushes to Apache Git repo. It would be nice if the script accepts user specified primary author information.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7508 from liancheng/spark-9179 and squashes the following commits:
      
      218d88e [Cheng Lian] Allows committers to specify primary author of the PR to be merged
      bc24289f
    • Reynold Xin's avatar
      [SQL] Make date/time functions more consistent with other database systems. · 3427937e
      Reynold Xin authored
      This pull request fixes some of the problems in #6981.
      
      - Added date functions to `__all__` so they get exposed
      - Rename day_of_month -> dayofmonth
      - Rename day_in_year -> dayofyear
      - Rename week_of_year -> weekofyear
      - Removed "day" from Scala/Python API since it is ambiguous. Only leaving the alias in SQL.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Reynold Xin <rxin@databricks.com>
      
      Closes #7506 from rxin/datetime and squashes the following commits:
      
      0cb24d9 [Reynold Xin] Export all functions in Python.
      e44a4a0 [Reynold Xin] Removed day function from Scala and Python.
      9c08fdc [Reynold Xin] [SQL] Make date/time functions more consistent with other database systems.
      3427937e
    • Tarek Auel's avatar
      [SPARK-8199][SQL] follow up; revert change in test · a53d13f7
      Tarek Auel authored
      rxin / davies
      
      Sorry for that unnecessary change. And thanks again for all your support!
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7505 from tarekauel/SPARK-8199-FollowUp and squashes the following commits:
      
      d09321c [Tarek Auel] [SPARK-8199] follow up; revert change in test
      c17397f [Tarek Auel] [SPARK-8199] follow up; revert change in test
      67acfe6 [Tarek Auel] [SPARK-8199] follow up; revert change in test
      a53d13f7
    • Carl Anders Düvel's avatar
      [SPARK-9094] [PARENT] Increased io.dropwizard.metrics from 3.1.0 to 3.1.2 · 344d1567
      Carl Anders Düvel authored
      We are running Spark 1.4.0 in production and ran into problems because after a network hiccup (which happens often in our current environment) no more metrics were reported to graphite leaving us blindfolded about the current state of our spark applications. [This problem](https://github.com/dropwizard/metrics/commit/70559816f1fc3a0a0122b5263d5478ff07396991) was fixed in the current version of the metrics library. We run spark with this change  in production now and have seen no problems. We also had a look at the commit history since 3.1.0 and did not detect any potentially  incompatible changes but many fixes which could potentially help other users as well.
      
      Author: Carl Anders Düvel <c.a.duevel@gmail.com>
      
      Closes #7493 from hackbert/bump-metrics-lib-version and squashes the following commits:
      
      6677565 [Carl Anders Düvel] [SPARK-9094] [PARENT] Increased io.dropwizard.metrics from 3.1.0 to 3.1.2 in order to get this fix https://github.com/dropwizard/metrics/commit/70559816f1fc3a0a0122b5263d5478ff07396991
      344d1567
    • Liang-Chi Hsieh's avatar
      [SPARK-9166][SQL][PYSPARK] Capture and hide IllegalArgumentException in Python API · 9b644c41
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9166
      
      Simply capture and hide `IllegalArgumentException` in Python API.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #7497 from viirya/hide_illegalargument and squashes the following commits:
      
      8324dce [Liang-Chi Hsieh] Fix python style.
      9ace67d [Liang-Chi Hsieh] Also check exception message.
      8b2ce5c [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into hide_illegalargument
      7be016a [Liang-Chi Hsieh] Capture and hide IllegalArgumentException in Python.
      9b644c41
    • Reynold Xin's avatar
      89d13585
    • Herman van Hovell's avatar
      [SPARK-8638] [SQL] Window Function Performance Improvements · a9a0d0ce
      Herman van Hovell authored
      ## Description
      Performance improvements for Spark Window functions. This PR will also serve as the basis for moving away from Hive UDAFs to Spark UDAFs. See JIRA tickets SPARK-8638 and SPARK-7712 for more information.
      
      ## Improvements
      * Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current implementation in spark uses a sliding window approach in these cases. This means that an aggregate is maintained for every row, so space usage is N (N being the number of rows). This also means that all these aggregates all need to be updated separately, this takes N*(N-1)/2 updates. The running case differs from the Sliding case because we are only adding data to an aggregate function (no reset is required), we only need to maintain one aggregate (like in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each row, and get the aggregate value after each update. This is what the new implementation does. This approach only uses 1 buffer, and only requires N updates; I am currently working on data with window sizes of 500-1000 doing running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED FOLLOWING case also uses this approach and the fact that aggregate operations are communitative, there is one twist though it will process the input buffer in reverse.
      * Fewer comparisons in the sliding case. The current implementation determines frame boundaries for every input row. The new implementation makes more use of the fact that the window is sorted, maintains the boundaries, and only moves them when the current row order changes. This is a minor improvement.
      * A single Window node is able to process all types of Frames for the same Partitioning/Ordering. This saves a little time/memory spent buffering and managing partitions. This will be enabled in a follow-up PR.
      * A lot of the staging code is moved from the execution phase to the initialization phase. Minor performance improvement, and improves readability of the execution code.
      
      ## Benchmarking
      I have done a small benchmark using [on time performance](http://www.transtats.bts.gov) data of the month april. I have used the origin as a partioning key, as a result there is quite some variation in window sizes. The code for the benchmark can be found in the JIRA ticket. These are the results per Frame type:
      
      Frame | Master | SPARK-8638
      ----- | ------ | ----------
      Entire Frame | 2 s | 1 s
      Sliding | 18 s | 1 s
      Growing | 14 s | 0.9 s
      Shrinking | 13 s | 1 s
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #7057 from hvanhovell/SPARK-8638 and squashes the following commits:
      
      3bfdc49 [Herman van Hovell] Fixed Perfomance Regression for Shrinking Window Frames (+Rebase)
      2eb3b33 [Herman van Hovell] Corrected reverse range frame processing.
      2cd2d5b [Herman van Hovell] Corrected reverse range frame processing.
      b0654d7 [Herman van Hovell] Tests for exotic frame specifications.
      e75b76e [Herman van Hovell] More docs, added support for reverse sliding range frames, and some reorganization of code.
      1fdb558 [Herman van Hovell] Changed Data In HiveDataFrameWindowSuite.
      ac2f682 [Herman van Hovell] Added a few more comments.
      1938312 [Herman van Hovell] Added Documentation to the createBoundOrdering methods.
      bb020e6 [Herman van Hovell] Major overhaul of Window operator.
      a9a0d0ce
    • Reynold Xin's avatar
      Fixed test cases. · 04c1b49f
      Reynold Xin authored
      04c1b49f
    • Tarek Auel's avatar
      [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-8182][SPARK-8181][SPARK-8180][SPARK... · 83b682be
      Tarek Auel authored
      [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-8182][SPARK-8181][SPARK-8180][SPARK-8179][SPARK-8177][SPARK-8178][SPARK-9115][SQL] date functions
      
      Jira:
      https://issues.apache.org/jira/browse/SPARK-8199
      https://issues.apache.org/jira/browse/SPARK-8184
      https://issues.apache.org/jira/browse/SPARK-8183
      https://issues.apache.org/jira/browse/SPARK-8182
      https://issues.apache.org/jira/browse/SPARK-8181
      https://issues.apache.org/jira/browse/SPARK-8180
      https://issues.apache.org/jira/browse/SPARK-8179
      https://issues.apache.org/jira/browse/SPARK-8177
      https://issues.apache.org/jira/browse/SPARK-8179
      https://issues.apache.org/jira/browse/SPARK-9115
      
      Regarding `day`and `dayofmonth` are both necessary?
      
      ~~I am going to add `Quarter` to this PR as well.~~ Done.
      
      ~~As soon as the Scala coding is reviewed and discussed, I'll add the python api.~~ Done
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      Author: Tarek Auel <tarek.auel@gmail.com>
      
      Closes #6981 from tarekauel/SPARK-8199 and squashes the following commits:
      
      f7b4c8c [Tarek Auel] [SPARK-8199] fixed bug in tests
      bb567b6 [Tarek Auel] [SPARK-8199] fixed test
      3e095ba [Tarek Auel] [SPARK-8199] style and timezone fix
      256c357 [Tarek Auel] [SPARK-8199] code cleanup
      5983dcc [Tarek Auel] [SPARK-8199] whitespace fix
      6e0c78f [Tarek Auel] [SPARK-8199] removed setTimeZone in tests, according to cloud-fans comment in #7488
      4afc09c [Tarek Auel] [SPARK-8199] concise leap year handling
      ea6c110 [Tarek Auel] [SPARK-8199] fix after merging master
      70238e0 [Tarek Auel] Merge branch 'master' into SPARK-8199
      3c6ae2e [Tarek Auel] [SPARK-8199] removed binary search
      fb98ba0 [Tarek Auel] [SPARK-8199] python docstring fix
      cdfae27 [Tarek Auel] [SPARK-8199] cleanup & python docstring fix
      746b80a [Tarek Auel] [SPARK-8199] build fix
      0ad6db8 [Tarek Auel] [SPARK-8199] minor fix
      523542d [Tarek Auel] [SPARK-8199] address comments
      2259299 [Tarek Auel] [SPARK-8199] day_of_month alias
      d01b977 [Tarek Auel] [SPARK-8199] python underscore
      56c4a92 [Tarek Auel] [SPARK-8199] update python docu
      e223bc0 [Tarek Auel] [SPARK-8199] refactoring
      d6aa14e [Tarek Auel] [SPARK-8199] fixed Hive compatibility
      b382267 [Tarek Auel] [SPARK-8199] fixed bug in day calculation; removed set TimeZone in HiveCompatibilitySuite for test purposes; removed Hive tests for second and minute, because we can cast '2015-03-18' to a timestamp and extract a minute/second from it
      1b2e540 [Tarek Auel] [SPARK-8119] style fix
      0852655 [Tarek Auel] [SPARK-8119] changed from ExpectsInputTypes to implicit casts
      ec87c69 [Tarek Auel] [SPARK-8119] bug fixing and refactoring
      1358cdc [Tarek Auel] Merge remote-tracking branch 'origin/master' into SPARK-8199
      740af0e [Tarek Auel] implement date function using a calculation based on days
      4fb66da [Tarek Auel] WIP: date functions on calculation only
      1a436c9 [Tarek Auel] wip
      f775f39 [Tarek Auel] fixed return type
      ad17e96 [Tarek Auel] improved implementation
      c42b444 [Tarek Auel] Removed merge conflict file
      ccb723c [Tarek Auel] [SPARK-8199] style and fixed merge issues
      10e4ad1 [Tarek Auel] Merge branch 'master' into date-functions-fast
      7d9f0eb [Tarek Auel] [SPARK-8199] git renaming issue
      f3e7a9f [Tarek Auel] [SPARK-8199] revert change in DataFrameFunctionsSuite
      6f5d95c [Tarek Auel] [SPARK-8199] fixed year interval
      d9f8ac3 [Tarek Auel] [SPARK-8199] implement fast track
      7bc9d93 [Tarek Auel] Merge branch 'master' into SPARK-8199
      5a105d9 [Tarek Auel] [SPARK-8199] rebase after #6985 got merged
      eb6760d [Tarek Auel] Merge branch 'master' into SPARK-8199
      f120415 [Tarek Auel] improved runtime
      a8edebd [Tarek Auel] use Calendar instead of SimpleDateFormat
      5fe74e1 [Tarek Auel] fixed python style
      3bfac90 [Tarek Auel] fixed style
      356df78 [Tarek Auel] rely on cast mechanism of Spark. Simplified implementation
      02efc5d [Tarek Auel] removed doubled code
      a5ea120 [Tarek Auel] added python api; changed test to be more meaningful
      b680db6 [Tarek Auel] added codegeneration to all functions
      c739788 [Tarek Auel] added support for quarter SPARK-8178
      849fb41 [Tarek Auel] fixed stupid test
      638596f [Tarek Auel] improved codegen
      4d8049b [Tarek Auel] fixed tests and added type check
      5ebb235 [Tarek Auel] resolved naming conflict
      d0e2f99 [Tarek Auel] date functions
      83b682be
  3. Jul 18, 2015
    • Forest Fang's avatar
      [SPARK-8443][SQL] Split GenerateMutableProjection Codegen due to JVM Code Size Limits · 6cb6096c
      Forest Fang authored
      By grouping projection calls into multiple apply function, we are able to push the number of projections codegen can handle from ~1k to ~60k. I have set the unit test to test against 5k as 60k took 15s for the unit test to complete.
      
      Author: Forest Fang <forest.fang@outlook.com>
      
      Closes #7076 from saurfang/codegen_size_limit and squashes the following commits:
      
      b7a7635 [Forest Fang] [SPARK-8443][SQL] Execute and verify split projections in test
      adef95a [Forest Fang] [SPARK-8443][SQL] Use safer factor and rewrite splitting code
      1b5aa7e [Forest Fang] [SPARK-8443][SQL] inline execution if one block only
      9405680 [Forest Fang] [SPARK-8443][SQL] split projection code by size limit
      6cb6096c
    • Reynold Xin's avatar
      [SPARK-8278] Remove non-streaming JSON reader. · 45d798c3
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7501 from rxin/jsonrdd and squashes the following commits:
      
      767ec55 [Reynold Xin] More Mima
      51f456e [Reynold Xin] Mima exclude.
      789cb80 [Reynold Xin] Fixed compilation error.
      b4cf50d [Reynold Xin] [SPARK-8278] Remove non-streaming JSON reader.
      45d798c3
    • Reynold Xin's avatar
      [SPARK-9150][SQL] Create CodegenFallback and Unevaluable trait · 9914b1b2
      Reynold Xin authored
      It is very hard to track which expressions have code gen implemented or not. This patch removes the default fallback gencode implementation from Expression, and moves that into a new trait called CodegenFallback. Each concrete expression needs to either implement code generation, or mix in CodegenFallback. This makes it very easy to track which expressions have code generation implemented already.
      
      Additionally, this patch creates an Unevaluable trait that can be used to track expressions that don't support evaluation (e.g. Star).
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7487 from rxin/codegenfallback and squashes the following commits:
      
      14ebf38 [Reynold Xin] Fixed Conv
      6c1c882 [Reynold Xin] Fixed Alias.
      b42611b [Reynold Xin] [SPARK-9150][SQL] Create a trait to track code generation for expressions.
      cb5c066 [Reynold Xin] Removed extra import.
      39cbe40 [Reynold Xin] [SPARK-8240][SQL] string function: concat
      9914b1b2
    • Reynold Xin's avatar
      [SPARK-9174][SQL] Add documentation for all public SQLConfs. · e16a19a3
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7500 from rxin/sqlconf and squashes the following commits:
      
      a5726c8 [Reynold Xin] [SPARK-9174][SQL] Add documentation for all public SQLConfs.
      e16a19a3
    • Reynold Xin's avatar
      [SPARK-8240][SQL] string function: concat · 6e1e2eba
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7486 from rxin/concat and squashes the following commits:
      
      5217d6e [Reynold Xin] Removed Hive's concat test.
      f5cb7a3 [Reynold Xin] Concat is never nullable.
      ae4e61f [Reynold Xin] Removed extra import.
      fddcbbd [Reynold Xin] Fixed NPE.
      22e831c [Reynold Xin] Added missing file.
      57a2352 [Reynold Xin] [SPARK-8240][SQL] string function: concat
      6e1e2eba
    • Yijie Shen's avatar
      [SPARK-9055][SQL] WidenTypes should also support Intersect and Except · 3d2134fc
      Yijie Shen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9055
      
      cc rxin
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7491 from yijieshen/widen and squashes the following commits:
      
      079fa52 [Yijie Shen] widenType support for intersect and expect
      3d2134fc
    • Reynold Xin's avatar
      Closes #6122 · cdc36eef
      Reynold Xin authored
      cdc36eef
    • Liang-Chi Hsieh's avatar
      [SPARK-9151][SQL] Implement code generation for Abs · 225de8da
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9151
      
      Add codegen support for `Abs`.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #7498 from viirya/abs_codegen and squashes the following commits:
      
      0c8410f [Liang-Chi Hsieh] Implement code generation for Abs.
      225de8da
    • Wenchen Fan's avatar
      [SPARK-9171][SQL] add and improve tests for nondeterministic expressions · 86c50bf7
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7496 from cloud-fan/tests and squashes the following commits:
      
      0958f90 [Wenchen Fan] improve test for nondeterministic expressions
      86c50bf7
    • Wenchen Fan's avatar
      [SPARK-9167][SQL] use UTC Calendar in `stringToDate` · 692378c0
      Wenchen Fan authored
      fix 2 bugs introduced in https://github.com/apache/spark/pull/7353
      
      1. we should use UTC Calendar when cast string to date . Before #7353 , we use `DateTimeUtils.fromJavaDate(Date.valueOf(s.toString))` to cast string to date, and `fromJavaDate` will call `millisToDays` to avoid the time zone issue. Now we use `DateTimeUtils.stringToDate(s)`, we should create a Calendar with UTC in the begging.
      2. we should not change the default time zone in test cases. The `threadLocalLocalTimeZone` and `threadLocalTimestampFormat` in `DateTimeUtils` will only be evaluated once for each thread, so we can't set the default time zone back anymore.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7488 from cloud-fan/datetime and squashes the following commits:
      
      9cd6005 [Wenchen Fan] address comments
      21ef293 [Wenchen Fan] fix 2 bugs in datetime
      692378c0
    • Wenchen Fan's avatar
      [SPARK-9142][SQL] remove more self type in catalyst · 1b4ff055
      Wenchen Fan authored
      a follow up of https://github.com/apache/spark/pull/7479.
      The `TreeNode` is the root case of the requirement of `self: Product =>` stuff, so why not make `TreeNode` extend `Product`?
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7495 from cloud-fan/self-type and squashes the following commits:
      
      8676af7 [Wenchen Fan] remove more self type
      1b4ff055
    • Josh Rosen's avatar
      [SPARK-9143] [SQL] Add planner rule for automatically inserting Unsafe <->... · b8aec6cd
      Josh Rosen authored
      [SPARK-9143] [SQL] Add planner rule for automatically inserting Unsafe <-> Safe row format converters
      
      Now that we have two different internal row formats, UnsafeRow and the old Java-object-based row format, we end up having to perform conversions between these two formats. These conversions should not be performed by the operators themselves; instead, the planner should be responsible for inserting appropriate format conversions when they are needed.
      
      This patch makes the following changes:
      
      - Add two new physical operators for performing row format conversions, `ConvertToUnsafe` and `ConvertFromUnsafe`.
      - Add new methods to `SparkPlan` to allow operators to express whether they output UnsafeRows and whether they can handle safe or unsafe rows as inputs.
      - Implement an `EnsureRowFormats` rule to automatically insert converter operators where necessary.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7482 from JoshRosen/unsafe-converter-planning and squashes the following commits:
      
      7450fa5 [Josh Rosen] Resolve conflicts in favor of choosing UnsafeRow
      5220cce [Josh Rosen] Add roundtrip converter test
      2bb8da8 [Josh Rosen] Add Union unsafe support + tests to bump up test coverage
      6f79449 [Josh Rosen] Add even more assertions to execute()
      08ce199 [Josh Rosen] Rename ConvertFromUnsafe -> ConvertToSafe
      0e2d548 [Josh Rosen] Add assertion if operators' input rows are in different formats
      cabb703 [Josh Rosen] Add tests for Filter
      3b11ce3 [Josh Rosen] Add missing test file.
      ae2195a [Josh Rosen] Fixes
      0fef0f8 [Josh Rosen] Rename file.
      d5f9005 [Josh Rosen] Finish writing EnsureRowFormats planner rule
      b5df19b [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-converter-planning
      9ba3038 [Josh Rosen] WIP
      b8aec6cd
    • Reynold Xin's avatar
      [SPARK-9169][SQL] Improve unit test coverage for null expressions. · fba3f5ba
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7490 from rxin/unit-test-null-funcs and squashes the following commits:
      
      7b276f0 [Reynold Xin] Move isNaN.
      8307287 [Reynold Xin] [SPARK-9169][SQL] Improve unit test coverage for null expressions.
      fba3f5ba
Loading