Skip to content
Snippets Groups Projects
  1. Jun 29, 2015
    • Josh Rosen's avatar
      [SPARK-8709] Exclude hadoop-client's mockito-all dependency · 27ef8545
      Josh Rosen authored
      This patch excludes `hadoop-client`'s dependency on `mockito-all`.  As of #7061, Spark depends on `mockito-core` instead of `mockito-all`, so the dependency from Hadoop was leading to test compilation failures for some of the Hadoop 2 SBT builds.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7090 from JoshRosen/SPARK-8709 and squashes the following commits:
      
      e190122 [Josh Rosen] [SPARK-8709] Exclude hadoop-client's mockito-all dependency.
      27ef8545
    • Davies Liu's avatar
      [SPARK-8070] [SQL] [PYSPARK] avoid spark jobs in createDataFrame · afae9766
      Davies Liu authored
      Avoid the unnecessary jobs when infer schema from list.
      
      cc yhuai mengxr
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6606 from davies/improve_create and squashes the following commits:
      
      a5928bf [Davies Liu] Update MimaExcludes.scala
      62da911 [Davies Liu] fix mima
      bab4d7d [Davies Liu] Merge branch 'improve_create' of github.com:davies/spark into improve_create
      eee44a8 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_create
      8d9292d [Davies Liu] Update context.py
      eb24531 [Davies Liu] Update context.py
      c969997 [Davies Liu] bug fix
      d5a8ab0 [Davies Liu] fix tests
      8c3f10d [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_create
      6ea5925 [Davies Liu] address comments
      6ceaeff [Davies Liu] avoid spark jobs in createDataFrame
      afae9766
    • Burak Yavuz's avatar
      [SPARK-8681] fixed wrong ordering of columns in crosstab · be7ef067
      Burak Yavuz authored
      I specifically randomized the test. What crosstab does is equivalent to a countByKey, therefore if this test fails again for any reason, we will know that we hit a corner case or something.
      
      cc rxin marmbrus
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #7060 from brkyvz/crosstab-fixes and squashes the following commits:
      
      0a65234 [Burak Yavuz] addressed comments v1
      d96da7e [Burak Yavuz] fixed wrong ordering of columns in crosstab
      be7ef067
    • Cheng Hao's avatar
      [SPARK-7862] [SQL] Disable the error message redirect to stderr · c6ba2ea3
      Cheng Hao authored
      This is a follow up of #6404, the ScriptTransformation prints the error msg into stderr directly, probably be a disaster for application log.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #6882 from chenghao-intel/verbose and squashes the following commits:
      
      bfedd77 [Cheng Hao] revert the write
      76ff46b [Cheng Hao] update the CircularBuffer
      692b19e [Cheng Hao] check the process exitValue for ScriptTransform
      47e0970 [Cheng Hao] Use the RedirectThread instead
      1de771d [Cheng Hao] naming the threads in ScriptTransformation
      8536e81 [Cheng Hao] disable the error message redirection for stderr
      c6ba2ea3
    • zhichao.li's avatar
      [SPARK-8214] [SQL] Add function hex · 637b4eed
      zhichao.li authored
      cc chenghao-intel  adrian-wang
      
      Author: zhichao.li <zhichao.li@intel.com>
      
      Closes #6976 from zhichao-li/hex and squashes the following commits:
      
      e218d1b [zhichao.li] turn off scalastyle for non-ascii
      de3f5ea [zhichao.li] non-ascii char
      cf9c936 [zhichao.li] give separated buffer for each hex method
      967ec90 [zhichao.li] Make 'value' as a feild of Hex
      3b2fa13 [zhichao.li] tiny fix
      a647641 [zhichao.li] remove duplicate null check
      7cab020 [zhichao.li] tiny refactoring
      35ecfe5 [zhichao.li] add function hex
      637b4eed
    • Kousuke Saruta's avatar
      [SQL][DOCS] Remove wrong example from DataFrame.scala · 94e040d0
      Kousuke Saruta authored
      In DataFrame.scala, there are examples like as follows.
      
      ```
       * // The following are equivalent:
       * peopleDf.filter($"age" > 15)
       * peopleDf.where($"age" > 15)
       * peopleDf($"age" > 15)
      ```
      
      But, I think the last example doesn't work.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #6977 from sarutak/fix-dataframe-example and squashes the following commits:
      
      46efbd7 [Kousuke Saruta] Removed wrong example
      94e040d0
    • Vladimir Vladimirov's avatar
      [SPARK-8528] Expose SparkContext.applicationId in PySpark · 492dca3a
      Vladimir Vladimirov authored
      Use case - we want to log applicationId (YARN in hour case) to request help with troubleshooting from the DevOps
      
      Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com>
      
      Closes #6936 from smartkiwi/master and squashes the following commits:
      
      870338b [Vladimir Vladimirov] this would make doctest to run in python3
      0eae619 [Vladimir Vladimirov] Scala doesn't use u'...' for unicode literals
      14d77a8 [Vladimir Vladimirov] stop using ELLIPSIS
      b4ebfc5 [Vladimir Vladimirov] addressed PR feedback - updated docstring
      223a32f [Vladimir Vladimirov] fixed test - applicationId is property that returns the string
      3221f5a [Vladimir Vladimirov] [SPARK-8528] added documentation for Scala
      2cff090 [Vladimir Vladimirov] [SPARK-8528] add applicationId property for SparkContext object in pyspark
      492dca3a
    • Tarek Auel's avatar
      [SPARK-8235] [SQL] misc function sha / sha1 · a5c2961c
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-8235
      
      I added the support for sha1. If I understood rxin correctly, sha and sha1 should execute the same algorithm, shouldn't they?
      
      Please take a close look on the Python part. This is adopted from #6934
      
      Author: Tarek Auel <tarek.auel@gmail.com>
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #6963 from tarekauel/SPARK-8235 and squashes the following commits:
      
      f064563 [Tarek Auel] change to shaHex
      7ce3cdc [Tarek Auel] rely on automatic cast
      a1251d6 [Tarek Auel] Merge remote-tracking branch 'upstream/master' into SPARK-8235
      68eb043 [Tarek Auel] added docstring
      be5aff1 [Tarek Auel] improved error message
      7336c96 [Tarek Auel] added type check
      cf23a80 [Tarek Auel] simplified example
      ebf75ef [Tarek Auel] [SPARK-8301] updated the python documentation. Removed sha in python and scala
      6d6ff0d [Tarek Auel] [SPARK-8233] added docstring
      ea191a9 [Tarek Auel] [SPARK-8233] fixed signatureof python function. Added expected type to misc
      e3fd7c3 [Tarek Auel] SPARK[8235] added sha to the list of __all__
      e5dad4e [Tarek Auel] SPARK[8235] sha / sha1
      a5c2961c
    • Marcelo Vanzin's avatar
      [SPARK-8066, SPARK-8067] [hive] Add support for Hive 1.0, 1.1 and 1.2. · 3664ee25
      Marcelo Vanzin authored
      Allow HiveContext to connect to metastores of those versions; some new shims
      had to be added to account for changing internal APIs.
      
      A new test was added to exercise the "reset()" path which now also requires
      a shim; and the test code was changed to use a directory under the build's
      target to store ivy dependencies. Without that, at least I consistently run
      into issues with Ivy messing up (or being confused) by my existing caches.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #7026 from vanzin/SPARK-8067 and squashes the following commits:
      
      3e2e67b [Marcelo Vanzin] [SPARK-8066, SPARK-8067] [hive] Add support for Hive 1.0, 1.1 and 1.2.
      3664ee25
    • Wenchen Fan's avatar
      [SPARK-8692] [SQL] re-order the case statements that handling catalyst data types · ed413bcc
      Wenchen Fan authored
      use same order: boolean, byte, short, int, date, long, timestamp, float, double, string, binary, decimal.
      
      Then we can easily check whether some data types are missing by just one glance, and make sure we handle data/timestamp just as int/long.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7073 from cloud-fan/fix-date and squashes the following commits:
      
      463044d [Wenchen Fan] fix style
      51cd347 [Wenchen Fan] refactor handling of date and timestmap
      ed413bcc
    • Andrew Or's avatar
    • Yu ISHIKAWA's avatar
      [SPARK-8554] Add the SparkR document files to `.rat-excludes` for `./dev/check-license` · 715f084c
      Yu ISHIKAWA authored
      [[SPARK-8554] Add the SparkR document files to `.rat-excludes` for `./dev/check-license` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8554)
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #6947 from yu-iskw/SPARK-8554 and squashes the following commits:
      
      5ca240c [Yu ISHIKAWA] [SPARK-8554] Add the SparkR document files to `.rat-excludes` for `./dev/check-license`
      715f084c
    • Brennon York's avatar
      [SPARK-8693] [PROJECT INFRA] profiles and goals are not printed in a nice way · 5c796d57
      Brennon York authored
      Hotfix to correct formatting errors of print statements within the dev and jenkins builds. Error looks like:
      
      ```
      -Phadoop-1[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments:  -Dhadoop.version=1.0.4[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments:  -Pkinesis-asl[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments:  -Phive-thriftserver[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments:  -Phive[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments:  package[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments:  assembly/assembly[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments:  streaming-kafka-assembly/assembly
      ```
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #7085 from brennonyork/SPARK-8693 and squashes the following commits:
      
      c5575f1 [Brennon York] added commas to end of print statements for proper printing
      5c796d57
    • zsxwing's avatar
      [SPARK-8702] [WEBUI] Avoid massive concating strings in Javascript · 630bd5fd
      zsxwing authored
      When there are massive tasks, such as `sc.parallelize(1 to 100000, 10000).count()`, the generated JS codes have a lot of string concatenations in the stage page, nearly 40 string concatenations for one task.
      
      We can generate the whole string for a task instead of execution string concatenations in the browser.
      
      Before this patch, the load time of the page is about 21 seconds.
      ![screen shot 2015-06-29 at 6 44 04 pm](https://cloud.githubusercontent.com/assets/1000778/8406644/eb55ed18-1e90-11e5-9ad5-50d27ad1dff1.png)
      
      After this patch, it reduces to about 17 seconds.
      
      ![screen shot 2015-06-29 at 6 47 34 pm](https://cloud.githubusercontent.com/assets/1000778/8406665/087003ca-1e91-11e5-80a8-3485aa9adafa.png)
      
      One disadvantage is that the generated JS codes become hard to read.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #7082 from zsxwing/js-string and squashes the following commits:
      
      b29231d [zsxwing] Avoid massive concating strings in Javascript
      630bd5fd
    • Reynold Xin's avatar
      [SPARK-8698] partitionBy in Python DataFrame reader/writer interface should... · 660c6cec
      Reynold Xin authored
      [SPARK-8698] partitionBy in Python DataFrame reader/writer interface should not default to empty tuple.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7079 from rxin/SPARK-8698 and squashes the following commits:
      
      8513e1c [Reynold Xin] [SPARK-8698] partitionBy in Python DataFrame reader/writer interface should not default to empty tuple.
      660c6cec
    • Cheolsoo Park's avatar
      [SPARK-8355] [SQL] Python DataFrameReader/Writer should mirror Scala · ac2e17b0
      Cheolsoo Park authored
      I compared PySpark DataFrameReader/Writer against Scala ones. `Option` function is missing in both reader and writer, but the rest seems to all match.
      
      I added `Option` to reader and writer and updated the `pyspark-sql` test.
      
      Author: Cheolsoo Park <cheolsoop@netflix.com>
      
      Closes #7078 from piaozhexiu/SPARK-8355 and squashes the following commits:
      
      c63d419 [Cheolsoo Park] Fix version
      524e0aa [Cheolsoo Park] Add option function to df reader and writer
      ac2e17b0
    • BenFradet's avatar
      [SPARK-8575] [SQL] Deprecate callUDF in favor of udf · 0b10662f
      BenFradet authored
      Follow up of [SPARK-8356](https://issues.apache.org/jira/browse/SPARK-8356) and #6902.
      Removes the unit test for the now deprecated ```callUdf```
      Unit test in SQLQuerySuite now uses ```udf``` instead of ```callUDF```
      Replaced ```callUDF``` by ```udf``` where possible in mllib
      
      Author: BenFradet <benjamin.fradet@gmail.com>
      
      Closes #6993 from BenFradet/SPARK-8575 and squashes the following commits:
      
      26f5a7a [BenFradet] 2 spaces instead of 1
      1ddb452 [BenFradet] renamed initUDF in order to be consistent in OneVsRest
      48ca15e [BenFradet] used vector type tag for udf call in VectorIndexer
      0ebd0da [BenFradet] replace the now deprecated callUDF by udf in VectorIndexer
      8013409 [BenFradet] replaced the now deprecated callUDF by udf in Predictor
      94345b5 [BenFradet] unifomized udf calls in ProbabilisticClassifier
      1305492 [BenFradet] uniformized udf calls in Classifier
      a672228 [BenFradet] uniformized udf calls in OneVsRest
      49e4904 [BenFradet] Revert "removal of the unit test for the now deprecated callUdf"
      bbdeaf3 [BenFradet] fixed syntax for init udf in OneVsRest
      fe2a10b [BenFradet] callUDF => udf in ProbabilisticClassifier
      0ea30b3 [BenFradet] callUDF => udf in Classifier where possible
      197ec82 [BenFradet] callUDF => udf in OneVsRest
      84d6780 [BenFradet] modified unit test in SQLQuerySuite to use udf instead of callUDF
      477709f [BenFradet] removal of the unit test for the now deprecated callUdf
      0b10662f
    • Yanbo Liang's avatar
      [SPARK-5962] [MLLIB] Python support for Power Iteration Clustering · dfde31da
      Yanbo Liang authored
      Python support for Power Iteration Clustering
      https://issues.apache.org/jira/browse/SPARK-5962
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #6992 from yanboliang/pyspark-pic and squashes the following commits:
      
      6b03d82 [Yanbo Liang] address comments
      4be4423 [Yanbo Liang] Python support for Power Iteration Clustering
      dfde31da
    • Feynman Liang's avatar
      [SPARK-7212] [MLLIB] Add sequence learning flag · 25f574eb
      Feynman Liang authored
      Support mining of ordered frequent item sequences.
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #6997 from feynmanliang/fp-sequence and squashes the following commits:
      
      7c14e15 [Feynman Liang] Improve scalatests with R code and Seq
      0d3e4b6 [Feynman Liang] Fix python test
      ce987cb [Feynman Liang] Backwards compatibility aux constructor
      34ef8f2 [Feynman Liang] Fix failing test due to reverse orderering
      f04bd50 [Feynman Liang] Naming, add ordered to FreqItemsets, test ordering using Seq
      648d4d4 [Feynman Liang] Test case for frequent item sequences
      252a36a [Feynman Liang] Add sequence learning flag
      25f574eb
  2. Jun 28, 2015
    • Cheng Lian's avatar
      [SPARK-7845] [BUILD] Bumping default Hadoop version used in profile hadoop-1 to 1.2.1 · 00a9d22b
      Cheng Lian authored
      PR #5694 reverted PR #6384 while refactoring `dev/run-tests` to `dev/run-tests.py`. Also, PR #6384 didn't bump Hadoop 1 version defined in POM.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7062 from liancheng/spark-7845 and squashes the following commits:
      
      c088b72 [Cheng Lian] Bumping default Hadoop version used in profile hadoop-1 to 1.2.1
      00a9d22b
    • Liang-Chi Hsieh's avatar
      [SPARK-8677] [SQL] Fix non-terminating decimal expansion for decimal divide operation · 24fda738
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-8677
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #7056 from viirya/fix_decimal3 and squashes the following commits:
      
      34d7419 [Liang-Chi Hsieh] Fix Non-terminating decimal expansion for decimal divide operation.
      24fda738
    • Vincent D. Warmerdam's avatar
      [SPARK-8596] [EC2] Added port for Rstudio · 9ce78b43
      Vincent D. Warmerdam authored
      This would otherwise need to be set manually by R users in AWS.
      
      https://issues.apache.org/jira/browse/SPARK-8596
      
      Author: Vincent D. Warmerdam <vincentwarmerdam@gmail.com>
      Author: vincent <vincentwarmerdam@gmail.com>
      
      Closes #7068 from koaning/rstudio-port-number and squashes the following commits:
      
      ac8100d [vincent] Update spark_ec2.py
      ce6ad88 [Vincent D. Warmerdam] added port number for rstudio
      9ce78b43
    • Kousuke Saruta's avatar
      [SPARK-8686] [SQL] DataFrame should support `where` with expression represented by String · ec784381
      Kousuke Saruta authored
      DataFrame supports `filter` function with two types of argument, `Column` and `String`. But `where` doesn't.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #7063 from sarutak/SPARK-8686 and squashes the following commits:
      
      180f9a4 [Kousuke Saruta] Added test
      d61aec4 [Kousuke Saruta] Add "where" method with String argument to DataFrame
      ec784381
    • Davies Liu's avatar
      [SPARK-8610] [SQL] Separate Row and InternalRow (part 2) · 77da5be6
      Davies Liu authored
      Currently, we use GenericRow both for Row and InternalRow, which is confusing because it could contain Scala type also Catalyst types.
      
      This PR changes to use GenericInternalRow for InternalRow (contains catalyst types), GenericRow for Row (contains Scala types).
      
      Also fixes some incorrect use of InternalRow or Row.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7003 from davies/internalrow and squashes the following commits:
      
      d05866c [Davies Liu] fix test: rollback changes for pyspark
      72878dd [Davies Liu] Merge branch 'master' of github.com:apache/spark into internalrow
      efd0b25 [Davies Liu] fix copy of MutableRow
      87b13cf [Davies Liu] fix test
      d2ebd72 [Davies Liu] fix style
      eb4b473 [Davies Liu] mark expensive API as final
      bd4e99c [Davies Liu] Merge branch 'master' of github.com:apache/spark into internalrow
      bdfb78f [Davies Liu] remove BaseMutableRow
      6f99a97 [Davies Liu] fix catalyst test
      defe931 [Davies Liu] remove BaseRow
      288b31f [Davies Liu] Merge branch 'master' of github.com:apache/spark into internalrow
      9d24350 [Davies Liu] separate Row and InternalRow (part 2)
      77da5be6
    • Thomas Szymanski's avatar
      [SPARK-8649] [BUILD] Mapr repository is not defined properly · 52d12818
      Thomas Szymanski authored
      The previous commiter on this part was pwendell
      
      The previous url gives 404, the new one seems to be OK.
      
      This patch is added under the Apache License 2.0.
      
      The JIRA link: https://issues.apache.org/jira/browse/SPARK-8649
      
      Author: Thomas Szymanski <develop@tszymanski.com>
      
      Closes #7054 from tszym/SPARK-8649 and squashes the following commits:
      
      bfda9c4 [Thomas Szymanski] [SPARK-8649] [BUILD] Mapr repository is not defined properly
      52d12818
    • Josh Rosen's avatar
      [SPARK-8683] [BUILD] Depend on mockito-core instead of mockito-all · f5100451
      Josh Rosen authored
      Spark's tests currently depend on `mockito-all`, which bundles Hamcrest and Objenesis classes. Instead, it should depend on `mockito-core`, which declares those libraries as Maven dependencies. This is necessary in order to fix a dependency conflict that leads to a NoSuchMethodError when using certain Hamcrest matchers.
      
      See https://github.com/mockito/mockito/wiki/Declaring-mockito-dependency for more details.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7061 from JoshRosen/mockito-core-instead-of-all and squashes the following commits:
      
      70eccbe [Josh Rosen] Depend on mockito-core instead of mockito-all.
      f5100451
    • Josh Rosen's avatar
      42db3a1c
  3. Jun 27, 2015
    • Josh Rosen's avatar
      [SPARK-8583] [SPARK-5482] [BUILD] Refactor python/run-tests to integrate with... · 40648c56
      Josh Rosen authored
      [SPARK-8583] [SPARK-5482] [BUILD] Refactor python/run-tests to integrate with dev/run-tests module system
      
      This patch refactors the `python/run-tests` script:
      
      - It's now written in Python instead of Bash.
      - The descriptions of the tests to run are now stored in `dev/run-tests`'s modules.  This allows the pull request builder to skip Python tests suites that were not affected by the pull request's changes.  For example, we can now skip the PySpark Streaming test cases when only SQL files are changed.
      - `python/run-tests` now supports command-line flags to make it easier to run individual test suites (this addresses SPARK-5482):
      
        ```
      Usage: run-tests [options]
      
      Options:
        -h, --help            show this help message and exit
        --python-executables=PYTHON_EXECUTABLES
                              A comma-separated list of Python executables to test
                              against (default: python2.6,python3.4,pypy)
        --modules=MODULES     A comma-separated list of Python modules to test
                              (default: pyspark-core,pyspark-ml,pyspark-mllib
                              ,pyspark-sql,pyspark-streaming)
         ```
      - `dev/run-tests` has been split into multiple files: the module definitions and test utility functions are now stored inside of a `dev/sparktestsupport` Python module, allowing them to be re-used from the Python test runner script.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6967 from JoshRosen/run-tests-python-modules and squashes the following commits:
      
      f578d6d [Josh Rosen] Fix print for Python 2.x
      8233d61 [Josh Rosen] Add python/run-tests.py to Python lint checks
      34c98d2 [Josh Rosen] Fix universal_newlines for Python 3
      8f65ed0 [Josh Rosen] Fix handling of  module in python/run-tests
      37aff00 [Josh Rosen] Python 3 fix
      27a389f [Josh Rosen] Skip MLLib tests for PyPy
      c364ccf [Josh Rosen] Use which() to convert PYSPARK_PYTHON to an absolute path before shelling out to run tests
      568a3fd [Josh Rosen] Fix hashbang
      3b852ae [Josh Rosen] Fall back to PYSPARK_PYTHON when sys.executable is None (fixes a test)
      f53db55 [Josh Rosen] Remove python2 flag, since the test runner script also works fine under Python 3
      9c80469 [Josh Rosen] Fix passing of PYSPARK_PYTHON
      d33e525 [Josh Rosen] Merge remote-tracking branch 'origin/master' into run-tests-python-modules
      4f8902c [Josh Rosen] Python lint fixes.
      8f3244c [Josh Rosen] Use universal_newlines to fix dev/run-tests doctest failures on Python 3.
      f542ac5 [Josh Rosen] Fix lint check for Python 3
      fff4d09 [Josh Rosen] Add dev/sparktestsupport to pep8 checks
      2efd594 [Josh Rosen] Update dev/run-tests to use new Python test runner flags
      b2ab027 [Josh Rosen] Add command-line options for running individual suites in python/run-tests
      caeb040 [Josh Rosen] Fixes to PySpark test module definitions
      d6a77d3 [Josh Rosen] Fix the tests of dev/run-tests
      def2d8a [Josh Rosen] Two minor fixes
      aec0b8f [Josh Rosen] Actually get the Kafka stuff to run properly
      04015b9 [Josh Rosen] First attempt at getting PySpark Kafka test to work in new runner script
      4c97136 [Josh Rosen] PYTHONPATH fixes
      dcc9c09 [Josh Rosen] Fix time division
      32660fc [Josh Rosen] Initial cut at Python test runner refactoring
      311c6a9 [Josh Rosen] Move shell utility functions to own module.
      1bdeb87 [Josh Rosen] Move module definitions to separate file.
      40648c56
    • Josh Rosen's avatar
      [SPARK-8606] Prevent exceptions in RDD.getPreferredLocations() from crashing DAGScheduler · 0b5abbf5
      Josh Rosen authored
      If `RDD.getPreferredLocations()` throws an exception it may crash the DAGScheduler and SparkContext. This patch addresses this by adding a try-catch block.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7023 from JoshRosen/SPARK-8606 and squashes the following commits:
      
      770b169 [Josh Rosen] Fix getPreferredLocations() DAGScheduler crash with try block.
      44a9b55 [Josh Rosen] Add test of a buggy getPartitions() method
      19aa9f7 [Josh Rosen] Add (failing) regression test for getPreferredLocations() DAGScheduler crash
      0b5abbf5
    • Sandy Ryza's avatar
      [SPARK-8623] Hadoop RDDs fail to properly serialize configuration · 4153776f
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #7050 from sryza/sandy-spark-8623 and squashes the following commits:
      
      58a8079 [Sandy Ryza] SPARK-8623. Hadoop RDDs fail to properly serialize configuration
      4153776f
    • Neelesh Srinivas Salian's avatar
      [SPARK-3629] [YARN] [DOCS]: Improvement of the "Running Spark on YARN" document · d48e7893
      Neelesh Srinivas Salian authored
      As per the description in the JIRA, I moved the contents of the page and added a few additional content.
      
      Author: Neelesh Srinivas Salian <nsalian@cloudera.com>
      
      Closes #6924 from nssalian/SPARK-3629 and squashes the following commits:
      
      944b7a0 [Neelesh Srinivas Salian] Changed the lines about deploy-mode and added backticks to all parameters
      40dbc0b [Neelesh Srinivas Salian] Changed dfs to HDFS, deploy-mode in backticks and updated the master yarn line
      9cbc072 [Neelesh Srinivas Salian] Updated a few lines in the Launching Spark on YARN Section
      8e8db7f [Neelesh Srinivas Salian] Removed the changes in this commit to help clearly distinguish movement from update
      151c298 [Neelesh Srinivas Salian] SPARK-3629: Improvement of the Spark on YARN document
      d48e7893
    • Rosstin's avatar
      [SPARK-8639] [DOCS] Fixed Minor Typos in Documentation · b5a6663d
      Rosstin authored
      Ticket: [SPARK-8639](https://issues.apache.org/jira/browse/SPARK-8639)
      
      fixed minor typos in docs/README.md and docs/api.md
      
      Author: Rosstin <asterazul@gmail.com>
      
      Closes #7046 from Rosstin/SPARK-8639 and squashes the following commits:
      
      6c18058 [Rosstin] fixed minor typos in docs/README.md and docs/api.md
      b5a6663d
  4. Jun 26, 2015
    • cafreeman's avatar
      [SPARK-8607] SparkR -- jars not being added to application classpath correctly · 9d118177
      cafreeman authored
      Add `getStaticClass` method in SparkR's `RBackendHandler`
      
      This is a fix for the problem referenced in [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185
      
      ).
      
      cc shivaram
      
      Author: cafreeman <cfreeman@alteryx.com>
      
      Closes #7001 from cafreeman/branch-1.4 and squashes the following commits:
      
      8f81194 [cafreeman] Add missing license
      31aedcf [cafreeman] Refactor test to call an external R script
      2c22073 [cafreeman] Merge branch 'branch-1.4' of github.com:apache/spark into branch-1.4
      0bea809 [cafreeman] Fixed relative path issue and added smaller JAR
      ee25e60 [cafreeman] Merge branch 'branch-1.4' of github.com:apache/spark into branch-1.4
      9a5c362 [cafreeman] test for including JAR when launching sparkContext
      9101223 [cafreeman] Merge branch 'branch-1.4' of github.com:apache/spark into branch-1.4
      5a80844 [cafreeman] Fix style nits
      7c6bd0c [cafreeman] [SPARK-8607] SparkR
      
      (cherry picked from commit 2579948b)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      9d118177
    • cafreeman's avatar
      [SPARK-8662] SparkR Update SparkSQL Test · a56516fc
      cafreeman authored
      Test `infer_type` using a more fine-grained approach rather than comparing environments. Since `all.equal`'s behavior has changed in R 3.2, the test became unpassable.
      
      JIRA here:
      https://issues.apache.org/jira/browse/SPARK-8662
      
      
      
      Author: cafreeman <cfreeman@alteryx.com>
      
      Closes #7045 from cafreeman/R32_Test and squashes the following commits:
      
      b97cc52 [cafreeman] Add `checkStructField` utility
      3381e5c [cafreeman] Update SparkSQL Test
      
      (cherry picked from commit 78b31a2a)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      a56516fc
    • Josh Rosen's avatar
      [SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod() · 41afa165
      Josh Rosen authored
      This patch addresses a critical issue in the PySpark tests:
      
      Several of our Python modules' `__main__` methods call `doctest.testmod()` in order to run doctests but forget to check and handle its return value. As a result, some PySpark test failures can go unnoticed because they will not fail the build.
      
      Fortunately, there was only one test failure which was masked by this bug: a `pyspark.profiler` doctest was failing due to changes in RDD pipelining.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7032 from JoshRosen/testmod-fix and squashes the following commits:
      
      60dbdc0 [Josh Rosen] Account for int vs. long formatting change in Python 3
      8b8d80a [Josh Rosen] Fix failing test.
      e6423f9 [Josh Rosen] Check return code for all uses of doctest.testmod().
      41afa165
    • Marcelo Vanzin's avatar
      [SPARK-8302] Support heterogeneous cluster install paths on YARN. · 37bf76a2
      Marcelo Vanzin authored
      Some users have Hadoop installations on different paths across
      their cluster. Currently, that makes it hard to set up some
      configuration in Spark since that requires hardcoding paths to
      jar files or native libraries, which wouldn't work on such a cluster.
      
      This change introduces a couple of YARN-specific configurations
      that instruct the backend to replace certain paths when launching
      remote processes. That way, if the configuration says the Spark
      jar is in "/spark/spark.jar", and also says that "/spark" should be
      replaced with "{{SPARK_INSTALL_DIR}}", YARN will start containers
      in the NMs with "{{SPARK_INSTALL_DIR}}/spark.jar" as the location
      of the jar.
      
      Coupled with YARN's environment whitelist (which allows certain
      env variables to be exposed to containers), this allows users to
      support such heterogeneous environments, as long as a single
      replacement is enough. (Otherwise, this feature would need to be
      extended to support multiple path replacements.)
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6752 from vanzin/SPARK-8302 and squashes the following commits:
      
      4bff8d4 [Marcelo Vanzin] Add docs, rename configs.
      0aa2a02 [Marcelo Vanzin] Only do replacement for paths that need it.
      2e9cc9d [Marcelo Vanzin] Style.
      a5e1f68 [Marcelo Vanzin] [SPARK-8302] Support heterogeneous cluster install paths on YARN.
      37bf76a2
    • Holden Karau's avatar
      [SPARK-8613] [ML] [TRIVIAL] add param to disable linear feature scaling · c9e05a31
      Holden Karau authored
      Add a param to disable linear feature scaling (to be implemented later in linear & logistic regression). Done as a seperate PR so we can use same param & not conflict while working on the sub-tasks.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #7024 from holdenk/SPARK-8522-Disable-Linear_featureScaling-Spark-8613-Add-param and squashes the following commits:
      
      ce8931a [Holden Karau] Regenerate the sharedParams code
      fa6427e [Holden Karau] update text for standardization param.
      7b24a2b [Holden Karau] generate the new standardization param
      3c190af [Holden Karau] Add the standardization param to sharedparamscodegen
      c9e05a31
    • Josh Rosen's avatar
      [SPARK-8344] Add message processing time metric to DAGScheduler · 9fed6abf
      Josh Rosen authored
      This commit adds a new metric, `messageProcessingTime`, to the DAGScheduler metrics source. This metrics tracks the time taken to process messages in the scheduler's event processing loop, which is a helpful debugging aid for diagnosing performance issues in the scheduler (such as SPARK-4961).
      
      In order to do this, I moved the creation of the DAGSchedulerSource metrics source into DAGScheduler itself, similar to how MasterSource is created and registered in Master.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7002 from JoshRosen/SPARK-8344 and squashes the following commits:
      
      57f914b [Josh Rosen] Fix import ordering
      7d6bb83 [Josh Rosen] Add message processing time metrics to DAGScheduler
      9fed6abf
    • Wenchen Fan's avatar
      [SPARK-8635] [SQL] improve performance of CatalystTypeConverters · 1a79f0eb
      Wenchen Fan authored
      In `CatalystTypeConverters.createToCatalystConverter`, we add special handling for primitive types. We can apply this strategy to more places to improve performance.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7018 from cloud-fan/converter and squashes the following commits:
      
      8b16630 [Wenchen Fan] another fix
      326c82c [Wenchen Fan] optimize type converter
      1a79f0eb
    • Wenchen Fan's avatar
      [SPARK-8620] [SQL] cleanup CodeGenContext · 40360112
      Wenchen Fan authored
      fix docs, remove nativeTypes , use java type to get boxed type ,default value, etc. to avoid handle `DateType` and `TimestampType` as int and long again and again.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7010 from cloud-fan/cg and squashes the following commits:
      
      aa01cf9 [Wenchen Fan] cleanup CodeGenContext
      40360112
Loading