Skip to content
Snippets Groups Projects
  1. Jun 19, 2015
    • Carson Wang's avatar
      [SPARK-8387] [FOLLOWUP ] [WEBUI] Update driver log URL to show only 4096 bytes · 54557f35
      Carson Wang authored
      This is to follow up #6834 , update the driver log URL as well for consistency.
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #6878 from carsonwang/logUrl and squashes the following commits:
      
      13be948 [Carson Wang] update log URL in YarnClusterSuite
      a0004f4 [Carson Wang] Update driver log URL to show only 4096 bytes
      54557f35
    • Kevin Conor's avatar
      [SPARK-8339] [PYSPARK] integer division for python 3 · fdf63f12
      Kevin Conor authored
      Itertools islice requires an integer for the stop argument.  Switching to integer division here prevents a ValueError when vs is evaluated above.
      
      davies
      
      This is my original work, and I license it to the project.
      
      Author: Kevin Conor <kevin@discoverybayconsulting.com>
      
      Closes #6794 from kconor/kconor-patch-1 and squashes the following commits:
      
      da5e700 [Kevin Conor] Integer division for batch size
      fdf63f12
    • Bryan Cutler's avatar
      [SPARK-8444] [STREAMING] Adding Python streaming example for queueStream · a2016b4b
      Bryan Cutler authored
      A Python example similar to the existing one for Scala.
      
      Author: Bryan Cutler <bjcutler@us.ibm.com>
      
      Closes #6884 from BryanCutler/streaming-queueStream-example-8444 and squashes the following commits:
      
      435ba7e [Bryan Cutler] [SPARK-8444] Fixed style checks, increased sleep time to show empty queue
      257abb0 [Bryan Cutler] [SPARK-8444] Stop context gracefully, Removed unused import, Added description comment
      376ef6e [Bryan Cutler] [SPARK-8444] Fixed bug causing DStream.pprint to append empty parenthesis to output instead of blank line
      1ff5f8b [Bryan Cutler] [SPARK-8444] Adding Python streaming example for queue_stream
      a2016b4b
    • Yu ISHIKAWA's avatar
      [SPARK-8348][SQL] Add in operator to DataFrame Column · 754929b1
      Yu ISHIKAWA authored
      I have added it for only Scala.
      
      TODO: we should also support `in` operator in Python.
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #6824 from yu-iskw/SPARK-8348 and squashes the following commits:
      
      e76d02f [Yu ISHIKAWA] Not use infix notation
      6f744ac [Yu ISHIKAWA] Fit the test cases because these used the old test data set.
      00077d3 [Yu ISHIKAWA] [SPARK-8348][SQL] Add in operator to DataFrame Column
      754929b1
    • Cheng Lian's avatar
      [SPARK-8458] [SQL] Don't strip scheme part of output path when writing ORC files · a71cbbde
      Cheng Lian authored
      `Path.toUri.getPath` strips scheme part of output path (from `file:///foo` to `/foo`), which causes ORC data source only writes to the file system configured in Hadoop configuration. Should use `Path.toString` instead.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6892 from liancheng/spark-8458 and squashes the following commits:
      
      87f8199 [Cheng Lian] Don't strip scheme of output path when writing ORC files
      a71cbbde
  2. Jun 18, 2015
    • Dibyendu Bhattacharya's avatar
      [SPARK-8080] [STREAMING] Receiver.store with Iterator does not give correct count at Spark UI · 3eaed876
      Dibyendu Bhattacharya authored
      tdas  zsxwing this is the new PR for Spark-8080
      
      I have merged https://github.com/apache/spark/pull/6659
      
      Also to mention , for MEMORY_ONLY settings , when Block is not able to unrollSafely to memory if enough space is not there, BlockManager won't try to put the block and ReceivedBlockHandler will throw SparkException as it could not find the block id in PutResult. Thus number of records in block won't be counted if Block failed to unroll in memory. Which is fine.
      
      For MEMORY_DISK settings , if BlockManager not able to unroll block to memory, block will still get deseralized to Disk. Same for WAL based store. So for those cases ( storage level = memory + disk )  number of records will be counted even though the block not able to unroll to memory.
      
      thus I added the isFullyConsumed in the CountingIterator but have not used it as such case will never happen that block not fully consumed and ReceivedBlockHandler still get the block ID.
      
      I have added few test cases to cover those block unrolling scenarios also.
      
      Author: Dibyendu Bhattacharya <dibyendu.bhattacharya1@pearson.com>
      Author: U-PEROOT\UBHATD1 <UBHATD1@PIN-L-PI046.PEROOT.com>
      
      Closes #6707 from dibbhatt/master and squashes the following commits:
      
      f6cb6b5 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
      f37cfd8 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
      5a8344a [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI Count ByteBufferBlock as 1 count
      fceac72 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
      0153e7e [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI Fixed comments given by @zsxwing
      4c5931d [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
      01e6dc8 [U-PEROOT\UBHATD1] A
      3eaed876
    • Lars Francke's avatar
      [SPARK-8462] [DOCS] Documentation fixes for Spark SQL · 4ce3bab8
      Lars Francke authored
      This fixes various minor documentation issues on the Spark SQL page
      
      Author: Lars Francke <lars.francke@gmail.com>
      
      Closes #6890 from lfrancke/SPARK-8462 and squashes the following commits:
      
      dd7e302 [Lars Francke] Merge branch 'master' into SPARK-8462
      34eff2c [Lars Francke] Minor documentation fixes
      4ce3bab8
    • Sandy Ryza's avatar
      [SPARK-8135] Don't load defaults when reconstituting Hadoop Configurations · 43f50dec
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #6679 from sryza/sandy-spark-8135 and squashes the following commits:
      
      c5554ff [Sandy Ryza] SPARK-8135. In SerializableWritable, don't load defaults when instantiating Configuration
      43f50dec
    • Reynold Xin's avatar
      [SPARK-8218][SQL] Binary log math function update. · dc413138
      Reynold Xin authored
      Some minor updates based on after merging #6725.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6871 from rxin/log and squashes the following commits:
      
      ab51542 [Reynold Xin] Use JVM log
      76fc8de [Reynold Xin] Fixed arg.
      a7c1522 [Reynold Xin] [SPARK-8218][SQL] Binary log math function update.
      dc413138
    • Josh Rosen's avatar
      [SPARK-8446] [SQL] Add helper functions for testing SparkPlan physical operators · 207a98ca
      Josh Rosen authored
      This patch introduces `SparkPlanTest`, a base class for unit tests of SparkPlan physical operators.  This is analogous to Spark SQL's existing `QueryTest`, which does something similar for end-to-end tests with actual queries.
      
      These helper methods provide nicer error output when tests fail and help developers to avoid writing lots of boilerplate in order to execute manually constructed physical plans.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      Author: Josh Rosen <rosenville@gmail.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #6885 from JoshRosen/spark-plan-test and squashes the following commits:
      
      f8ce275 [Josh Rosen] Fix some IntelliJ inspections and delete some dead code
      84214be [Josh Rosen] Add an extra column which isn't part of the sort
      ae1896b [Josh Rosen] Provide implicits automatically
      a80f9b0 [Josh Rosen] Merge pull request #4 from marmbrus/pr/6885
      d9ab1e4 [Michael Armbrust] Add simple resolver
      c60a44d [Josh Rosen] Manually bind references
      996332a [Josh Rosen] Add types so that tests compile
      a46144a [Josh Rosen] WIP
      207a98ca
    • zsxwing's avatar
      [SPARK-8376] [DOCS] Add common lang3 to the Spark Flume Sink doc · 24e53793
      zsxwing authored
      Commons Lang 3 has been added as one of the dependencies of Spark Flume Sink since #5703. This PR updates the doc for it.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6829 from zsxwing/flume-sink-dep and squashes the following commits:
      
      f8617f0 [zsxwing] Add common lang3 to the Spark Flume Sink doc
      24e53793
    • Josh Rosen's avatar
      [SPARK-8353] [DOCS] Show anchor links when hovering over documentation headers · 44c931f0
      Josh Rosen authored
      This patch uses [AnchorJS](https://bryanbraun.github.io/anchorjs/) to show deep anchor links when hovering over headers in the Spark documentation. For example:
      
      ![image](https://cloud.githubusercontent.com/assets/50748/8240800/1502f85c-15ba-11e5-819a-97b231370a39.png)
      
      This makes it easier for users to link to specific sections of the documentation.
      
      I also removed some dead Javascript which isn't used in our current docs (it was introduced for the old AMPCamp training, but isn't used anymore).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6808 from JoshRosen/SPARK-8353 and squashes the following commits:
      
      e59d8a7 [Josh Rosen] Suppress underline on hover
      f518b6a [Josh Rosen] Turn on for all headers, since we use H1s in a bunch of places
      a9fec01 [Josh Rosen] Add anchor links when hovering over headers; remove some dead JS code
      44c931f0
    • Davies Liu's avatar
      [SPARK-8202] [PYSPARK] fix infinite loop during external sort in PySpark · 9b200272
      Davies Liu authored
      The batch size during external sort will grow up to max 10000, then shrink down to zero, causing infinite loop.
      Given the assumption that the items usually have similar size, so we don't need to adjust the batch size after first spill.
      
      cc JoshRosen rxin angelini
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6714 from davies/batch_size and squashes the following commits:
      
      b170dfb [Davies Liu] update test
      b9be832 [Davies Liu] Merge branch 'batch_size' of github.com:davies/spark into batch_size
      6ade745 [Davies Liu] update test
      5c21777 [Davies Liu] Update shuffle.py
      e746aec [Davies Liu] fix batch size during sort
      9b200272
    • Liang-Chi Hsieh's avatar
      [SPARK-8363][SQL] Move sqrt to math and extend UnaryMathExpression · 31641128
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-8363
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6823 from viirya/move_sqrt and squashes the following commits:
      
      8977e11 [Liang-Chi Hsieh] Remove unnecessary old tests.
      d23e79e [Liang-Chi Hsieh] Explicitly indicate sqrt value sequence.
      699f48b [Liang-Chi Hsieh] Use correct @since tag.
      8dff6d1 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into move_sqrt
      bc2ed77 [Liang-Chi Hsieh] Remove/move arithmetic expression test and expression type checking test. Remove unnecessary Sqrt type rule.
      d38492f [Liang-Chi Hsieh] Now sqrt accepts boolean because type casting is handled by HiveTypeCoercion.
      297cc90 [Liang-Chi Hsieh] Sqrt only accepts double input.
      ef4a21a [Liang-Chi Hsieh] Move sqrt to math.
      31641128
    • Neelesh Srinivas Salian's avatar
      [SPARK-8320] [STREAMING] Add example in streaming programming guide that shows... · ddc5baf1
      Neelesh Srinivas Salian authored
      [SPARK-8320] [STREAMING] Add example in streaming programming guide that shows union of multiple input streams
      
      Added python code to https://spark.apache.org/docs/latest/streaming-programming-guide.html
      to the Level of Parallelism in Data Receiving section.
      
      Please review and let me know if there are any additional changes that are needed.
      
      Thank you.
      
      Author: Neelesh Srinivas Salian <nsalian@cloudera.com>
      
      Closes #6862 from nssalian/SPARK-8320 and squashes the following commits:
      
      4bfd126 [Neelesh Srinivas Salian] Changed loop structure to be more in line with Python style
      e5345de [Neelesh Srinivas Salian] Changes to kafak append, for loop and show to print()
      3fc5c6d [Neelesh Srinivas Salian] SPARK-8320
      ddc5baf1
    • Yijie Shen's avatar
      [SPARK-8283][SQL] Resolve udf_struct test failure in HiveCompatibilitySuite · e86fbdb1
      Yijie Shen authored
      This PR aimed to resolve udf_struct test failure in HiveCompatibilitySuite.
      
      Currently, this is done by loosening CreateStruct's children type from NamedExpression to Expression and automatically generating StructField name for non-NamedExpression children.
      
      The naming convention for unnamed children follows the udf's counterpart in Hive:
      `col1, col2, col3, ...`
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #6828 from yijieshen/SPARK-8283 and squashes the following commits:
      
      6052b73 [Yijie Shen] Doc fix
      677e0b7 [Yijie Shen] Resolve udf_struct test failure by automatically generate structField name for non-NamedExpression children
      e86fbdb1
    • Liang-Chi Hsieh's avatar
      [SPARK-8218][SQL] Add binary log math function · fee3438a
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-8218
      
      Because there is already `log` unary function defined, the binary log function is called `logarithm` for now.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6725 from viirya/expr_binary_log and squashes the following commits:
      
      bf96bd9 [Liang-Chi Hsieh] Compare log result in string.
      102070d [Liang-Chi Hsieh] Round log result to better comparing in python test.
      fd01863 [Liang-Chi Hsieh] For comments.
      beed631 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
      6089d11 [Liang-Chi Hsieh] Remove unnecessary override.
      8cf37b7 [Liang-Chi Hsieh] For comments.
      bc89597 [Liang-Chi Hsieh] For comments.
      db7dc38 [Liang-Chi Hsieh] Use ctor instead of companion object.
      0634ef7 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
      1750034 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
      3d75bfc [Liang-Chi Hsieh] Fix scala style.
      5b39c02 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
      23c54a3 [Liang-Chi Hsieh] Fix scala style.
      ebc9929 [Liang-Chi Hsieh] Let Logarithm accept one parameter too.
      605574d [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
      21c3bfd [Liang-Chi Hsieh] Fix scala style.
      c6c187f [Liang-Chi Hsieh] For comments.
      c795342 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
      f373bac [Liang-Chi Hsieh] Add binary log expression.
      fee3438a
    • zsxwing's avatar
      [SPARK-7961][SQL]Refactor SQLConf to display better error message · 78a430ea
      zsxwing authored
      1. Add `SQLConfEntry` to store the information about a configuration. For those configurations that cannot be found in `sql-programming-guide.md`, I left the doc as `<TODO>`.
      2. Verify the value when setting a configuration if this is in SQLConf.
      3. Use `SET -v` to display all public configurations.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6747 from zsxwing/sqlconf and squashes the following commits:
      
      7d09bad [zsxwing] Use SQLConfEntry in HiveContext
      49f6213 [zsxwing] Add getConf, setConf to SQLContext and HiveContext
      e014f53 [zsxwing] Merge branch 'master' into sqlconf
      93dad8e [zsxwing] Fix the unit tests
      cf950c1 [zsxwing] Fix the code style and tests
      3c5f03e [zsxwing] Add unsetConf(SQLConfEntry) and fix the code style
      a2f4add [zsxwing] getConf will return the default value if a config is not set
      037b1db [zsxwing] Add schema to SetCommand
      0520c3c [zsxwing] Merge branch 'master' into sqlconf
      7afb0ec [zsxwing] Fix the configurations about HiveThriftServer
      7e728e3 [zsxwing] Add doc for SQLConfEntry and fix 'toString'
      5e95b10 [zsxwing] Add enumConf
      c6ba76d [zsxwing] setRawString => setConfString, getRawString => getConfString
      4abd807 [zsxwing] Fix the test for 'set -v'
      6e47e56 [zsxwing] Fix the compilation error
      8973ced [zsxwing] Remove floatConf
      1fc3a8b [zsxwing] Remove the 'conf' command and use 'set -v' instead
      99c9c16 [zsxwing] Fix tests that use SQLConfEntry as a string
      88a03cc [zsxwing] Add new lines between confs and return types
      ce7c6c8 [zsxwing] Remove seqConf
      f3c1b33 [zsxwing] Refactor SQLConf to display better error message
      78a430ea
    • Lianhui Wang's avatar
      [SPARK-8381][SQL]reuse typeConvert when convert Seq[Row] to catalyst type · 9db73ec1
      Lianhui Wang authored
      reuse-typeConvert when convert Seq[Row] to CatalystType
      
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      
      Closes #6831 from lianhuiwang/reuse-typeConvert and squashes the following commits:
      
      1fec395 [Lianhui Wang] remove CatalystTypeConverters.convertToCatalyst
      714462d [Lianhui Wang] add package[sql]
      9d1fbf3 [Lianhui Wang] address JoshRosen's comments
      768956f [Lianhui Wang] update scala style
      4498c62 [Lianhui Wang] reuse typeConvert
      9db73ec1
    • Burak Yavuz's avatar
      [SPARK-8095] Resolve dependencies of --packages in local ivy cache · 3b610770
      Burak Yavuz authored
      Dependencies of artifacts in the local ivy cache were not being resolved properly. The dependencies were not being picked up. Now they should be.
      
      cc andrewor14
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #6788 from brkyvz/local-ivy-fix and squashes the following commits:
      
      2875bf4 [Burak Yavuz] fix temp dir bug
      48cc648 [Burak Yavuz] improve deletion
      a69e3e6 [Burak Yavuz] delete cache before test as well
      0037197 [Burak Yavuz] fix merge conflicts
      f60772c [Burak Yavuz] use different folder for m2 cache during testing
      b6ef038 [Burak Yavuz] [SPARK-8095] Resolve dependencies of Spark Packages in local ivy cache
      3b610770
    • xutingjun's avatar
      [SPARK-8392] RDDOperationGraph: getting cached nodes is slow · e2cdb056
      xutingjun authored
      ```def getAllNodes: Seq[RDDOperationNode] =
      { _childNodes ++ _childClusters.flatMap(_.childNodes) }```
      
      when the ```_childClusters``` has so many nodes, the process will hang on. I think we can improve the efficiency here.
      
      Author: xutingjun <xutingjun@huawei.com>
      
      Closes #6839 from XuTingjun/DAGImprove and squashes the following commits:
      
      53b03ea [xutingjun] change code to more concise and easier to read
      f98728b [xutingjun] fix words: node -> nodes
      f87c663 [xutingjun] put the filter inside
      81f9fd2 [xutingjun] put the filter inside
      e2cdb056
    • MechCoder's avatar
      [SPARK-7605] [MLLIB] [PYSPARK] Python API for ElementwiseProduct · 22732e1e
      MechCoder authored
      Python API for org.apache.spark.mllib.feature.ElementwiseProduct
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6346 from MechCoder/spark-7605 and squashes the following commits:
      
      79d1ef5 [MechCoder] Consistent and support list / array types
      5f81d81 [MechCoder] [SPARK-7605] [MLlib] Python API for ElementwiseProduct
      22732e1e
    • zsxwing's avatar
      [SPARK-8373] [PYSPARK] Remove PythonRDD.emptyRDD · 4817ccdf
      zsxwing authored
      This is a follow-up PR to remove unused `PythonRDD.emptyRDD` added by #6826
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6867 from zsxwing/remove-PythonRDD-emptyRDD and squashes the following commits:
      
      b66d363 [zsxwing] Remove PythonRDD.emptyRDD
      4817ccdf
  3. Jun 17, 2015
    • Josh Rosen's avatar
    • Punya Biswal's avatar
      [SPARK-8397] [SQL] Allow custom configuration for TestHive · d1069cba
      Punya Biswal authored
      We encourage people to use TestHive in unit tests, because it's
      impossible to create more than one HiveContext within one process. The
      current implementation locks people into using a local[2] SparkContext
      underlying their HiveContext.  We should make it possible to override
      this using a system property so that people can test against
      local-cluster or remote spark clusters to make their tests more
      realistic.
      
      Author: Punya Biswal <pbiswal@palantir.com>
      
      Closes #6844 from punya/feature/SPARK-8397 and squashes the following commits:
      
      97ef394 [Punya Biswal] [SPARK-8397][SQL] Allow custom configuration for TestHive
      d1069cba
    • zsxwing's avatar
      [SPARK-8404] [STREAMING] [TESTS] Use thread-safe collections to make the tests more reliable · a06d9c8e
      zsxwing authored
      KafkaStreamSuite, DirectKafkaStreamSuite, JavaKafkaStreamSuite and JavaDirectKafkaStreamSuite use non-thread-safe collections to collect data in one thread and check it in another thread. It may fail the tests.
      
      This PR changes them to thread-safe collections.
      
      Note: I cannot reproduce the test failures in my environment. But at least, this PR should make the tests more reliable.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6852 from zsxwing/fix-KafkaStreamSuite and squashes the following commits:
      
      d464211 [zsxwing] Use thread-safe collections to make the tests more reliable
      a06d9c8e
    • Yin Huai's avatar
      [SPARK-8306] [SQL] AddJar command needs to set the new class loader to the... · 302556ff
      Yin Huai authored
      [SPARK-8306] [SQL] AddJar command needs to set the new class loader to the HiveConf inside executionHive.state.
      
      https://issues.apache.org/jira/browse/SPARK-8306
      
      I will try to add a test later.
      
      marmbrus aarondav
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6758 from yhuai/SPARK-8306 and squashes the following commits:
      
      1292346 [Yin Huai] [SPARK-8306] AddJar command needs to set the new class loader to the HiveConf inside executionHive.state.
      302556ff
    • Wenchen Fan's avatar
      [SPARK-7067] [SQL] fix bug when use complex nested fields in ORDER BY · 7f05b1fe
      Wenchen Fan authored
      This PR is a improvement for https://github.com/apache/spark/pull/5189.
      
      The resolution rule for ORDER BY is: first resolve based on what comes from the select clause and then fall back on its child only when this fails.
      
      There are 2 steps. First, try to resolve `Sort` in `ResolveReferences` based on select clause, and ignore exceptions. Second, try to resolve `Sort` in `ResolveSortReferences` and add missing projection.
      
      However, the way we resolve `SortOrder` is wrong. We just resolve `UnresolvedAttribute` and use the result to indicate if we can resolve `SortOrder`. But `UnresolvedAttribute` is only part of `GetField` chain(broken by `GetItem`), so we need to go through the whole chain to indicate if we can resolve `SortOrder`.
      
      With this change, we can also avoid re-throw GetField exception in `CheckAnalysis` which is little ugly.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #5659 from cloud-fan/order-by and squashes the following commits:
      
      cfa79f8 [Wenchen Fan] update test
      3245d28 [Wenchen Fan] minor improve
      465ee07 [Wenchen Fan] address comment
      1fc41a2 [Wenchen Fan] fix SPARK-7067
      7f05b1fe
    • zsxwing's avatar
      [SPARK-7913] [CORE] Increase the maximum capacity of PartitionedPairBuffe,... · a411a40d
      zsxwing authored
      [SPARK-7913] [CORE] Increase the maximum capacity of PartitionedPairBuffe, PartitionedSerializedPairBuffer and AppendOnlyMap
      
      The previous growing strategy is alway doubling the capacity.
      
      This PR adjusts the growing strategy: doubling the capacity but if overflow, use the maximum capacity as the new capacity. It increases the maximum capacity of PartitionedPairBuffer from `2 ^ 29` to `2 ^ 30 - 1`, the maximum capacity of PartitionedSerializedPairBuffer from `2 ^ 28` to `(2 ^ 29) - 1`, and the maximum capacity of AppendOnlyMap from `0.7 * (2 ^ 29)` to `(2 ^ 29)`.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6456 from zsxwing/SPARK-7913 and squashes the following commits:
      
      abcb932 [zsxwing] Address comments
      e30b61b [zsxwing] Increase the maximum capacity of AppendOnlyMap
      05b6420 [zsxwing] Update the exception message
      64fe227 [zsxwing] Increase the maximum capacity of PartitionedPairBuffer and PartitionedSerializedPairBuffer
      a411a40d
    • zsxwing's avatar
      [SPARK-8373] [PYSPARK] Add emptyRDD to pyspark and fix the issue when calling sum on an empty RDD · 0fc4b96f
      zsxwing authored
      This PR fixes the sum issue and also adds `emptyRDD` so that it's easy to create a test case.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6826 from zsxwing/python-emptyRDD and squashes the following commits:
      
      b36993f [zsxwing] Update the return type to JavaRDD[T]
      71df047 [zsxwing] Add emptyRDD to pyspark and fix the issue when calling sum on an empty RDD
      0fc4b96f
    • Carson Wang's avatar
      [SPARK-8372] History server shows incorrect information for application not started · 2837e067
      Carson Wang authored
      The history server may show an incorrect App ID for an incomplete application like <App ID>.inprogress. This app info will never disappear even after the app is completed.
      ![incorrectappinfo](https://cloud.githubusercontent.com/assets/9278199/8156147/2a10fdbe-137d-11e5-9620-c5b61d93e3c1.png)
      
      The cause of the issue is that a log path name is used as the app id when app id cannot be got during replay.
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #6827 from carsonwang/SPARK-8372 and squashes the following commits:
      
      cdbb089 [Carson Wang] Fix code style
      3e46b35 [Carson Wang] Update code style
      90f5dde [Carson Wang] Add a unit test
      d8c9cd0 [Carson Wang] Replaying events only return information when app is started
      2837e067
    • Mingfei's avatar
      [SPARK-8161] Set externalBlockStoreInitialized to be true, after ExternalBlockStore is initialized · 7ad8c5d8
      Mingfei authored
      externalBlockStoreInitialized is never set to be true, which causes the blocks stored in ExternalBlockStore can not be removed.
      
      Author: Mingfei <mingfei.shi@intel.com>
      
      Closes #6702 from shimingfei/SetTrue and squashes the following commits:
      
      add61d8 [Mingfei] Set externalBlockStoreInitialized to be true, after ExternalBlockStore is initialized
      7ad8c5d8
    • OopsOutOfMemory's avatar
      [SPARK-8010] [SQL] Promote types to StringType as implicit conversion in... · 98ee3512
      OopsOutOfMemory authored
      [SPARK-8010] [SQL] Promote types to StringType as implicit conversion in non-binary expression of HiveTypeCoercion
      
      1. Given a query
      `select coalesce(null, 1, '1') from dual` will cause exception:
      java.lang.RuntimeException: Could not determine return type of Coalesce for IntegerType,StringType
      2. Given a query:
      `select case when true then 1 else '1' end from dual` will cause exception:
      java.lang.RuntimeException: Types in CASE WHEN must be the same or coercible to a common type: StringType != IntegerType
      I checked the code, the main cause is the HiveTypeCoercion doesn't do implicit convert when there is a IntegerType and StringType.
      
      Numeric types can be promoted to string type
      
      Hive will always do this implicit conversion.
      
      Author: OopsOutOfMemory <victorshengli@126.com>
      
      Closes #6551 from OopsOutOfMemory/pnts and squashes the following commits:
      
      7a209d7 [OopsOutOfMemory] rebase master
      6018613 [OopsOutOfMemory] convert function to method
      4cd5618 [OopsOutOfMemory] limit the data type to primitive type
      df365d2 [OopsOutOfMemory] refine
      95cbd58 [OopsOutOfMemory] fix style
      403809c [OopsOutOfMemory] promote non-string to string when can not found tighestCommonTypeOfTwo
      98ee3512
    • Imran Rashid's avatar
      [SPARK-6782] add sbt-revolver plugin · a4659443
      Imran Rashid authored
      to make it easier to start & stop http servers in sbt
      https://issues.apache.org/jira/browse/SPARK-6782
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #5426 from squito/SPARK-6782 and squashes the following commits:
      
      dc4fb19 [Imran Rashid] add sbt-revolved plugin, to make it easier to start & stop http servers in sbt
      a4659443
    • Sean Owen's avatar
      [SPARK-8395] [DOCS] start-slave.sh docs incorrect · f005be02
      Sean Owen authored
      start-slave.sh no longer takes a worker # param in 1.4+
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #6855 from srowen/SPARK-8395 and squashes the following commits:
      
      300278e [Sean Owen] start-slave.sh no longer takes a worker # param in 1.4+
      f005be02
    • Michael Davies's avatar
      [SPARK-8077] [SQL] Optimization for TreeNodes with large numbers of children · 0c1b2df0
      Michael Davies authored
      For example large IN clauses
      
      Large IN clauses are parsed very slowly. For example SQL below (10K items in IN) takes 45-50s.
      
      s"""SELECT * FROM Person WHERE ForeName IN ('${(1 to 10000).map("n" + _).mkString("','")}')"""
      
      This is principally due to TreeNode which repeatedly call contains on children, where children in this case is a List that is 10K long. In effect parsing for large IN clauses is O(N squared).
      A lazily initialised Set based on children for contains reduces parse time to around 2.5s
      
      Author: Michael Davies <Michael.BellDavies@gmail.com>
      
      Closes #6673 from MickDavies/SPARK-8077 and squashes the following commits:
      
      38cd425 [Michael Davies] SPARK-8077: Optimization for  TreeNodes with large numbers of children
      d80103b [Michael Davies] SPARK-8077: Optimization for  TreeNodes with large numbers of children
      e6be8be [Michael Davies] SPARK-8077: Optimization for  TreeNodes with large numbers of children
      0c1b2df0
    • Brennon York's avatar
      [SPARK-7017] [BUILD] [PROJECT INFRA] Refactor dev/run-tests into Python · 50a0496a
      Brennon York authored
      All, this is a first attempt at refactoring `dev/run-tests` into Python. Initially I merely converted all Bash calls over to Python, then moved to a much more modular approach (more functions, moved the calls around, etc.). What is here is the initial culmination and should provide a great base to various downstream issues (e.g. SPARK-7016, modularize / parallelize testing, etc.). Would love comments / suggestions for this initial first step!
      
      /cc srowen pwendell nchammas
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #5694 from brennonyork/SPARK-7017 and squashes the following commits:
      
      154ed73 [Brennon York] updated finding java binary if JAVA_HOME not set
      3922a85 [Brennon York] removed necessary passed in variable
      f9fbe54 [Brennon York] reverted doc test change
      8135518 [Brennon York] removed the test check for documentation changes until jenkins can get updated
      05d435b [Brennon York] added check for jekyll install
      22edb78 [Brennon York] add check if jekyll isn't installed on the path
      2dff136 [Brennon York] fixed pep8 whitespace errors
      767a668 [Brennon York] fixed path joining issues, ensured docs actually build on doc changes
      c42cf9a [Brennon York] unpack set operations with splat (*)
      fb85a41 [Brennon York] fixed minor set bug
      0379833 [Brennon York] minor doc addition to print the changed modules
      aa03d9e [Brennon York] added documentation builds as a top level test component, altered high level project changes to properly execute core tests only when necessary, changed variable names for simplicity
      ec1ae78 [Brennon York] minor name changes, bug fixes
      b7c72b9 [Brennon York] reverting streaming context
      03fdd7b [Brennon York] fixed the tuple () wraps around example lambda
      705d12e [Brennon York] changed example to comply with pep3113 supporting python3
      60b3d51 [Brennon York] prepend rather than append onto PATH
      7d2f5e2 [Brennon York] updated python tests to remove unused variable
      2898717 [Brennon York] added a change to streaming test to check if it only runs streaming tests
      eb684b6 [Brennon York] fixed sbt_test_goals reference error
      db7ae6f [Brennon York] reverted SPARK_HOME from start of command
      1ecca26 [Brennon York] fixed merge conflicts
      2fcdfc0 [Brennon York] testing targte branch dump on jenkins
      1f607b1 [Brennon York] finalizing revisions to modular tests
      8afbe93 [Brennon York] made error codes a global
      0629de8 [Brennon York] updated to refactor and remove various small bugs, removed pep8 complaints
      d90ab2d [Brennon York] fixed merge conflicts, ensured that for regular builds both core and sql tests always run
      b1248dc [Brennon York] exec python rather than running python and exiting with return code
      f9deba1 [Brennon York] python to python2 and removed newline
      6d0a052 [Brennon York] incorporated merge conflicts with SPARK-7249
      f950010 [Brennon York] removed building hive-0.12.0 per SPARK-6908
      703f095 [Brennon York] fixed merge conflicts
      b1ca593 [Brennon York] reverted the sparkR test
      afeb093 [Brennon York] updated to make sparkR test fail
      1dada6b [Brennon York] reverted pyspark test failure
      9a592ec [Brennon York] reverted mima exclude issue, added pyspark test failure
      d825aa4 [Brennon York] revert build break, add mima break
      f041d8a [Brennon York] added space from commented import to now test build breaking
      983f2a2 [Brennon York] comment out import to fail build test
      2386785 [Brennon York] Merge remote-tracking branch 'upstream/master' into SPARK-7017
      76335fb [Brennon York] reverted rat license issue for sparkconf
      e4a96cc [Brennon York] removed the import error and added license error, fixed the way run-tests and run-tests.py report their error codes
      56d3cb9 [Brennon York] changed test back and commented out import to break compile
      b37328c [Brennon York] fixed typo and added default return is no error block was found in the environment
      7613558 [Brennon York] updated to return the proper env variable for return codes
      a5bd445 [Brennon York] reverted license, changed test in shuffle to fail
      803143a [Brennon York] removed license file for SparkContext
      b0b2604 [Brennon York] comment out import to see if build fails and returns properly
      83e80ef [Brennon York] attempt at better python output when called from bash
      c095fa6 [Brennon York] removed another wait() call
      26e18e8 [Brennon York] removed unnecessary wait()
      07210a9 [Brennon York] minor doc string change for java version with namedtuple update
      ec03bf3 [Brennon York] added namedtuple for java version to add readability
      2cb413b [Brennon York] upcased global variables, changes various calling methods from check_output to check_call
      639f1e9 [Brennon York] updated with pep8 rules, fixed minor bugs, added run-tests file in bash to call the run-tests.py script
      3c53a1a [Brennon York] uncomment the scala tests :)
      6126c4f [Brennon York] refactored run-tests into python
      50a0496a
    • MechCoder's avatar
      [SPARK-6390] [SQL] [MLlib] Port MatrixUDT to PySpark · 6765ef98
      MechCoder authored
      MatrixUDT was recently coded in scala. This has been ported to PySpark
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6354 from MechCoder/spark-6390 and squashes the following commits:
      
      fc4dc1e [MechCoder] Better error message
      c940a44 [MechCoder] Added test
      aa9c391 [MechCoder] Add pyUDT to MatrixUDT
      62a2a7d [MechCoder] [SPARK-6390] Port MatrixUDT to PySpark
      6765ef98
    • Liang-Chi Hsieh's avatar
      [SPARK-7199] [SQL] Add date and timestamp support to UnsafeRow · 104f30c3
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7199
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5984 from viirya/add_date_timestamp and squashes the following commits:
      
      7f21ce9 [Liang-Chi Hsieh] For comment.
      0b89698 [Liang-Chi Hsieh] Add timestamp to settableFieldTypes.
      c30d490 [Liang-Chi Hsieh] Use default IntUnsafeColumnWriter and LongUnsafeColumnWriter.
      672ef17 [Liang-Chi Hsieh] Remove getter/setter for Date and Timestamp and use Int and Long for them.
      9f3e577 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp
      281e844 [Liang-Chi Hsieh] Fix scala style.
      fb532b5 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp
      80af342 [Liang-Chi Hsieh] Fix compiling error.
      f4f5de6 [Liang-Chi Hsieh] Fix scala style.
      a463e83 [Liang-Chi Hsieh] Use Long to store timestamp for rows.
      635388a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp
      46946c6 [Liang-Chi Hsieh] Adapt for moved DateUtils.
      b16994e [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp
      752251f [Liang-Chi Hsieh] Support setDate. Fix failed test.
      fcf8db9 [Liang-Chi Hsieh] Add functions for Date and Timestamp to SpecificRow.
      e42a809 [Liang-Chi Hsieh] Fix style.
      4c07b57 [Liang-Chi Hsieh] Add date and timestamp support to UnsafeRow.
      104f30c3
    • Vyacheslav Baranov's avatar
      [SPARK-8309] [CORE] Support for more than 12M items in OpenHashMap · c13da20a
      Vyacheslav Baranov authored
      The problem occurs because the position mask `0xEFFFFFF` is incorrect. It has zero 25th bit, so when capacity grows beyond 2^24, `OpenHashMap` calculates incorrect index of value in `_values` array.
      
      I've also added a size check in `rehash()`, so that it fails instead of reporting invalid item indices.
      
      Author: Vyacheslav Baranov <slavik.baranov@gmail.com>
      
      Closes #6763 from SlavikBaranov/SPARK-8309 and squashes the following commits:
      
      8557445 [Vyacheslav Baranov] Resolved review comments
      4d5b954 [Vyacheslav Baranov] Resolved review comments
      eaf1e68 [Vyacheslav Baranov] Fixed failing test
      f9284fd [Vyacheslav Baranov] Resolved review comments
      3920656 [Vyacheslav Baranov] SPARK-8309: Support for more than 12M items in OpenHashMap
      c13da20a
Loading