Skip to content
Snippets Groups Projects
  1. Sep 28, 2014
    • Reynold Xin's avatar
      Minor fix for the previous commit. · 66e1c40c
      Reynold Xin authored
      66e1c40c
    • Dale's avatar
      SPARK-CORE [SPARK-3651] Group common CoarseGrainedSchedulerBackend variables together · 9966d1a8
      Dale authored
      from [SPARK-3651]
      In CoarseGrainedSchedulerBackend, we have:
      
          private val executorActor = new HashMap[String, ActorRef]
          private val executorAddress = new HashMap[String, Address]
          private val executorHost = new HashMap[String, String]
          private val freeCores = new HashMap[String, Int]
          private val totalCores = new HashMap[String, Int]
      
      We only ever put / remove stuff from these maps together. It would simplify the code if we consolidate these all into one map as we have done in JobProgressListener in https://issues.apache.org/jira/browse/SPARK-2299.
      
      Author: Dale <tigerquoll@outlook.com>
      
      Closes #2533 from tigerquoll/SPARK-3651 and squashes the following commits:
      
      d1be0a9 [Dale] [SPARK-3651]  implemented suggested changes. Changed a reference from executorInfo to executorData to be consistent with other usages
      6890663 [Dale] [SPARK-3651]  implemented suggested changes
      7d671cf [Dale] [SPARK-3651]  Grouped variables under a ExecutorDataObject, and reference them via a map entry as they are all retrieved under the same key
      9966d1a8
  2. Sep 27, 2014
    • Uri Laserson's avatar
      [SPARK-3389] Add Converter for ease of Parquet reading in PySpark · 24823293
      Uri Laserson authored
      https://issues.apache.org/jira/browse/SPARK-3389
      
      Author: Uri Laserson <laserson@cloudera.com>
      
      Closes #2256 from laserson/SPARK-3389 and squashes the following commits:
      
      0ed363e [Uri Laserson] PEP8'd the python file
      0b4b380 [Uri Laserson] Moved converter to examples and added python example
      eecf4dc [Uri Laserson] [SPARK-3389] Add Converter for ease of Parquet reading in PySpark
      24823293
    • Reynold Xin's avatar
      [SPARK-3543] Clean up Java TaskContext implementation. · 5b922bb4
      Reynold Xin authored
      This addresses some minor issues in https://github.com/apache/spark/pull/2425
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2557 from rxin/TaskContext and squashes the following commits:
      
      a51e5f6 [Reynold Xin] [SPARK-3543] Clean up Java TaskContext implementation.
      5b922bb4
    • Davies Liu's avatar
      [SPARK-3681] [SQL] [PySpark] fix serialization of List and Map in SchemaRDD · 0d8cdf0e
      Davies Liu authored
      Currently, the schema of object in ArrayType or MapType is attached lazily, it will have better performance but introduce issues while serialization or accessing nested objects.
      
      This patch will apply schema to the objects of ArrayType or MapType immediately when accessing them, will be a little bit slower, but much robust.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2526 from davies/nested and squashes the following commits:
      
      2399ae5 [Davies Liu] fix serialization of List and Map in SchemaRDD
      0d8cdf0e
    • Michael Armbrust's avatar
      [SPARK-3680][SQL] Fix bug caused by eager typing of HiveGenericUDFs · f0c7e195
      Michael Armbrust authored
      Typing of UDFs should be lazy as it is often not valid to call `dataType` on an expression until after all of its children are `resolved`.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2525 from marmbrus/concatBug and squashes the following commits:
      
      5b8efe7 [Michael Armbrust] fix bug with eager typing of udfs
      f0c7e195
    • w00228970's avatar
      [SPARK-3676][SQL] Fix hive test suite failure due to diffs in JDK 1.6/1.7 · 08008810
      w00228970 authored
      This is a bug in JDK6: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4428022
      
      this is because jdk get different result to operate ```double```,
      ```System.out.println(1/500d)``` in different jdk get different result
      jdk 1.6.0(_31) ---- 0.0020
      jdk 1.7.0(_05) ---- 0.002
      this leads to HiveQuerySuite failed when generate golden answer in jdk 1.7 and run tests in jdk 1.6, result did not match
      
      Author: w00228970 <wangfei1@huawei.com>
      
      Closes #2517 from scwf/HiveQuerySuite and squashes the following commits:
      
      0cb5e8d [w00228970] delete golden answer of division-0 and timestamp cast #1
      1df3964 [w00228970] Jdk version leads to different query output for Double, this make HiveQuerySuite failed
      08008810
    • CrazyJvm's avatar
      Docs : use "--total-executor-cores" rather than "--cores" after spark-shell · 66107f46
      CrazyJvm authored
      Author: CrazyJvm <crazyjvm@gmail.com>
      
      Closes #2540 from CrazyJvm/standalone-core and squashes the following commits:
      
      66d9fc6 [CrazyJvm] use "--total-executor-cores" rather than "--cores" after spark-shell
      66107f46
    • Reynold Xin's avatar
      Minor cleanup to tighten visibility and remove compilation warning. · 436a7730
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2555 from rxin/cleanup and squashes the following commits:
      
      6add199 [Reynold Xin] Minor cleanup to tighten visibility and remove compilation warning.
      436a7730
    • Erik Erlandson's avatar
      [SPARK-1021] Defer the data-driven computation of partition bounds in so... · 2d972fd8
      Erik Erlandson authored
      ...rtByKey() until evaluation.
      
      Author: Erik Erlandson <eerlands@redhat.com>
      
      Closes #1689 from erikerlandson/spark-1021-pr and squashes the following commits:
      
      50b6da6 [Erik Erlandson] use standard getIteratorSize in countAsync
      4e334a9 [Erik Erlandson] exception mystery fixed by fixing bug in ComplexFutureAction
      b88b5d4 [Erik Erlandson] tweak async actions to use ComplexFutureAction[T] so they handle RangePartitioner sampling job properly
      b2b20e8 [Erik Erlandson] Fix bug in exception passing with ComplexFutureAction[T]
      ca8913e [Erik Erlandson] RangePartition sampling job -> FutureAction
      7143f97 [Erik Erlandson] [SPARK-1021] modify range bounds variable to be thread safe
      ac67195 [Erik Erlandson] [SPARK-1021] Defer the data-driven computation of partition bounds in sortByKey() until evaluation.
      2d972fd8
    • Jeff Steinmetz's avatar
      stop, start and destroy require the EC2_REGION · 9e8ced78
      Jeff Steinmetz authored
      i.e
      ./spark-ec2 --region=us-west-1 stop yourclustername
      
      Author: Jeff Steinmetz <jeffrey.steinmetz@gmail.com>
      
      Closes #2473 from jeffsteinmetz/master and squashes the following commits:
      
      7491f2c [Jeff Steinmetz] fix case in EC2 cluster setup documentation
      bd3d777 [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args
      2bf4a57 [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args
      68d8372 [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args
      d2ab6e2 [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args
      520e6dc [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args
      37fc876 [Jeff Steinmetz] stop, start and destroy require the EC2_REGION
      9e8ced78
    • Michael Armbrust's avatar
      [SPARK-3675][SQL] Allow starting a JDBC server on an existing context · d8a9d1d4
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2515 from marmbrus/jdbcExistingContext and squashes the following commits:
      
      7866fad [Michael Armbrust] Allows starting a JDBC server on an existing context.
      d8a9d1d4
    • Michael Armbrust's avatar
      [SQL][DOCS] Clarify that the server is for JDBC and ODBC · f0eea76d
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2527 from marmbrus/patch-1 and squashes the following commits:
      
      a0f9f1c [Michael Armbrust] [SQL][DOCS] Clarify that the server is for JDBC and ODBC
      f0eea76d
    • wangfei's avatar
      [Build]remove spark-staging-1030 · 0cdcdd2c
      wangfei authored
      Since 1.1.0 has published, remove spark-staging-1030.
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #2532 from scwf/patch-2 and squashes the following commits:
      
      bc9e00b [wangfei] remove spark-staging-1030
      0cdcdd2c
    • Sarah Gerweck's avatar
      Slaves file is now a template. · e976ca23
      Sarah Gerweck authored
      Change 0dc868e7 removed the `conf/slaves` file and made it a template like most of the other configuration files. This means you can no longer run `make-distribution.sh` unless you manually create a slaves file to be statically bundled in your distribution, which seems at odds with making it a template file.
      
      Author: Sarah Gerweck <sarah.a180@gmail.com>
      
      Closes #2549 from sarahgerweck/noMoreSlaves and squashes the following commits:
      
      d11d99a [Sarah Gerweck] Slaves file is now a template.
      e976ca23
  3. Sep 26, 2014
    • Reynold Xin's avatar
      Close #2194. · a3feaf04
      Reynold Xin authored
      a3feaf04
    • Prashant Sharma's avatar
      [SPARK-3543] Write TaskContext in Java and expose it through a static accessor. · 5e34855c
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Shashank Sharma <shashank21j@gmail.com>
      
      Closes #2425 from ScrapCodes/SPARK-3543/withTaskContext and squashes the following commits:
      
      8ae414c [Shashank Sharma] CR
      ee8bd00 [Prashant Sharma] Added internal API in docs comments.
      ddb8cbe [Prashant Sharma] Moved setting the thread local to where TaskContext is instantiated.
      a7d5e23 [Prashant Sharma] Added doc comments.
      edf945e [Prashant Sharma] Code review git add -A
      f716fd1 [Prashant Sharma] introduced thread local for getting the task context.
      333c7d6 [Prashant Sharma] Translated Task context from scala to java.
      5e34855c
    • Josh Rosen's avatar
      Revert "[SPARK-3478] [PySpark] Profile the Python tasks" · f872e4fb
      Josh Rosen authored
      This reverts commit 1aa549ba.
      f872e4fb
    • Cheng Hao's avatar
      [SPARK-3393] [SQL] Align the log4j configuration for Spark & SparkSQLCLI · 7364fa5a
      Cheng Hao authored
      User may be confused for the HQL logging & configurations, we'd better provide a default templates.
      
      Both files are copied from Hive.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #2263 from chenghao-intel/hive_template and squashes the following commits:
      
      53bffa9 [Cheng Hao] Remove the hive-log4j.properties initialization
      7364fa5a
    • Daoyuan Wang's avatar
      [SPARK-3531][SQL]select null from table would throw a MatchError · 0ec2d2e8
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #2396 from adrian-wang/selectnull and squashes the following commits:
      
      2458229 [Daoyuan Wang] rebase solution
      0ec2d2e8
    • Andrew Or's avatar
      [SPARK-3476] Remove outdated memory checks in Yarn · 8da10bf1
      Andrew Or authored
      See description in [JIRA](https://issues.apache.org/jira/browse/SPARK-3476).
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #2528 from andrewor14/yarn-memory-checks and squashes the following commits:
      
      c5400cd [Andrew Or] Simplify checks
      e30ffac [Andrew Or] Remove outdated memory checks
      8da10bf1
    • Daoyuan Wang's avatar
      [SPARK-3695]shuffle fetch fail output · 30461c6a
      Daoyuan Wang authored
      should output detailed host and port in error message
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #2539 from adrian-wang/fetchfail and squashes the following commits:
      
      6c1b1e0 [Daoyuan Wang] shuffle fetch fail output
      30461c6a
    • RJ Nowling's avatar
      [SPARK-3614][MLLIB] Add minimumOccurence filtering to IDF · ec9df6a7
      RJ Nowling authored
      This PR for [SPARK-3614](https://issues.apache.org/jira/browse/SPARK-3614) adds functionality for filtering out terms which do not appear in at least a minimum number of documents.
      
      This is implemented using a minimumOccurence parameter (default 0).  When terms' document frequencies are less than minimumOccurence, their IDFs are set to 0, just like when the DF is 0.  As a result, the TF-IDFs for the terms are found to be 0, as if the terms were not present in the documents.
      
      This PR makes the following changes:
      * Add a minimumOccurence parameter to the IDF and DocumentFrequencyAggregator classes.
      * Create a parameter-less constructor for IDF with a default minimumOccurence value of 0 to remain backwards-compatibility with the original IDF API.
      * Sets the IDFs to 0 for terms which DFs are less than minimumOccurence
      * Add tests to the Spark IDFSuite and Java JavaTfIdfSuite test suites
      * Updated the MLLib Feature Extraction programming guide to describe the new feature
      
      Author: RJ Nowling <rnowling@gmail.com>
      
      Closes #2494 from rnowling/spark-3614-idf-filter and squashes the following commits:
      
      0aa3c63 [RJ Nowling] Fix identation
      e6523a8 [RJ Nowling] Remove unnecessary toDouble's from IDFSuite
      bfa82ec [RJ Nowling] Add space after if
      30d20b3 [RJ Nowling] Add spaces around equals signs
      9013447 [RJ Nowling] Add space before division operator
      79978fc [RJ Nowling] Remove unnecessary semi-colon
      40fd70c [RJ Nowling] Change minimumOccurence to minDocFreq in code and docs
      47850ab [RJ Nowling] Changed minimumOccurence to Int from Long
      9fb4093 [RJ Nowling] Remove unnecessary lines from IDF class docs
      1fc09d8 [RJ Nowling] Add backwards-compatible constructor to DocumentFrequencyAggregator
      1801fd2 [RJ Nowling] Fix style errors in IDF.scala
      6897252 [RJ Nowling] Preface minimumOccurence members with val to make them final and immutable
      a200bab [RJ Nowling] Remove unnecessary else statement
      4b974f5 [RJ Nowling] Remove accidentally-added import from testing
      c0cc643 [RJ Nowling] Add minimumOccurence filtering to IDF
      ec9df6a7
    • aniketbhatnagar's avatar
      SPARK-3639 | Removed settings master in examples · d16e161d
      aniketbhatnagar authored
      This patch removes setting of master as local in Kinesis examples so that users can set it using submit-job.
      
      Author: aniketbhatnagar <aniket.bhatnagar@gmail.com>
      
      Closes #2536 from aniketbhatnagar/Kinesis-Examples-Master-Unset and squashes the following commits:
      
      c9723ac [aniketbhatnagar] Merge remote-tracking branch 'origin/Kinesis-Examples-Master-Unset' into Kinesis-Examples-Master-Unset
      fec8ead [aniketbhatnagar] SPARK-3639 | Removed settings master in examples
      31cdc59 [aniketbhatnagar] SPARK-3639 | Removed settings master in examples
      d16e161d
    • Davies Liu's avatar
      [SPARK-3478] [PySpark] Profile the Python tasks · 1aa549ba
      Davies Liu authored
      This patch add profiling support for PySpark, it will show the profiling results
      before the driver exits, here is one example:
      
      ```
      ============================================================
      Profile of RDD<id=3>
      ============================================================
               5146507 function calls (5146487 primitive calls) in 71.094 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
             20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
             20    0.017    0.001    0.017    0.001 {cPickle.dumps}
           1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
             20    0.001    0.000    0.001    0.000 {reduce}
             21    0.001    0.000    0.001    0.000 {cPickle.loads}
             20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
             41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
             40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
             62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
             20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
             20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
          40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
             41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
             40    0.000    0.000   71.072    1.777 rdd.py:304(func)
             20    0.000    0.000   71.094    3.555 worker.py:82(process)
      ```
      
      Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
      by `sc.dump_profiles(path)`, such as
      
      ```python
      >>> sc._conf.set("spark.python.profile", "true")
      >>> rdd = sc.parallelize(range(100)).map(str)
      >>> rdd.count()
      100
      >>> sc.show_profiles()
      ============================================================
      Profile of RDD<id=1>
      ============================================================
               284 function calls (276 primitive calls) in 0.001 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
              4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
              4    0.000    0.000    0.000    0.000 {reduce}
           12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
              4    0.000    0.000    0.000    0.000 {cPickle.loads}
              4    0.000    0.000    0.000    0.000 {cPickle.dumps}
            104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
              8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
             12    0.000    0.000    0.000    0.000 rdd.py:303(func)
      ```
      The profiling is disabled by default, can be enabled by "spark.python.profile=true".
      
      Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2351 from davies/profiler and squashes the following commits:
      
      7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
      2b0daf2 [Davies Liu] fix docs
      7a56c24 [Davies Liu] bugfix
      cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
      fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      09d02c3 [Davies Liu] Merge branch 'master' into profiler
      c23865c [Davies Liu] Merge branch 'master' into profiler
      15d6f18 [Davies Liu] add docs for two configs
      dadee1a [Davies Liu] add docs string and clear profiles after show or dump
      4f8309d [Davies Liu] address comment, add tests
      0a5b6eb [Davies Liu] fix Python UDF
      4b20494 [Davies Liu] add profile for python
      1aa549ba
    • Hari Shreedharan's avatar
      [SPARK-3686][STREAMING] Wait for sink to commit the channel before check... · b235e013
      Hari Shreedharan authored
      ...ing for the channel size.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #2531 from harishreedharan/sparksinksuite-fix and squashes the following commits:
      
      30393c1 [Hari Shreedharan] Use more deterministic method to figure out when batches come in.
      6ce9d8b [Hari Shreedharan] [SPARK-3686][STREAMING] Wait for sink to commit the channel before checking for the channel size.
      b235e013
  4. Sep 25, 2014
    • zsxwing's avatar
      SPARK-2634: Change MapOutputTrackerWorker.mapStatuses to ConcurrentHashMap · 86bce764
      zsxwing authored
      MapOutputTrackerWorker.mapStatuses is used concurrently, it should be thread-safe. This bug has already been fixed in #1328. Nevertheless, considering #1328 won't be merged soon, I send this trivial fix and hope this issue can be solved soon.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #1541 from zsxwing/SPARK-2634 and squashes the following commits:
      
      d450053 [zsxwing] SPARK-2634: Change MapOutputTrackerWorker.mapStatuses to ConcurrentHashMap
      86bce764
    • Kousuke Saruta's avatar
      [SPARK-3584] sbin/slaves doesn't work when we use password authentication for SSH · 0dc868e7
      Kousuke Saruta authored
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2444 from sarutak/slaves-scripts-modification and squashes the following commits:
      
      eff7394 [Kousuke Saruta] Improve the description about Cluster Launch Script in docs/spark-standalone.md
      7858225 [Kousuke Saruta] Modified sbin/slaves to use the environment variable "SPARK_SSH_FOREGROUND" as a flag
      53d7121 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into slaves-scripts-modification
      e570431 [Kousuke Saruta] Added a description for SPARK_SSH_FOREGROUND variable
      7120a0c [Kousuke Saruta] Added a description about default host for sbin/slaves
      1bba8a9 [Kousuke Saruta] Added SPARK_SSH_FOREGROUND flag to sbin/slaves
      88e2f17 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into slaves-scripts-modification
      297e75d [Kousuke Saruta] Modified sbin/slaves not to export HOSTLIST
      0dc868e7
    • Aaron Staple's avatar
      [SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached data. · ff637c93
      Aaron Staple authored
      Add warnings to KMeans, GeneralizedLinearAlgorithm, and computeSVD when called with input data that is not cached. KMeans is implemented iteratively, and I believe that GeneralizedLinearAlgorithm’s current optimizers are iterative and its future optimizers are also likely to be iterative. RowMatrix’s computeSVD is iterative against an RDD when run in DistARPACK mode. ALS and DecisionTree are iterative as well, but they implement RDD caching internally so do not require a warning.
      
      I added a warning to GeneralizedLinearAlgorithm rather than inside its optimizers, where the iteration actually occurs, because internally GeneralizedLinearAlgorithm maps its input data to an uncached RDD before passing it to an optimizer. (In other words, the warning would be printed for every GeneralizedLinearAlgorithm run, regardless of whether its input is cached, if the warning were in GradientDescent or other optimizer.) I assume that use of an uncached RDD by GeneralizedLinearAlgorithm is intentional, and that the mapping there (adding label, intercepts and scaling) is a lightweight operation. Arguably a user calling an optimizer such as GradientDescent will be knowledgable enough to cache their data without needing a log warning, so lack of a warning in the optimizers may be ok.
      
      Some of the documentation examples making use of these iterative algorithms did not cache their training RDDs (while others did). I updated the examples to always cache. I also fixed some (unrelated) minor errors in the documentation examples.
      
      Author: Aaron Staple <aaron.staple@gmail.com>
      
      Closes #2347 from staple/SPARK-1484 and squashes the following commits:
      
      bd49701 [Aaron Staple] Address review comments.
      ab2d4a4 [Aaron Staple] Disable warnings on python code path.
      a7a0f99 [Aaron Staple] Change code comments per review comments.
      7cca1dc [Aaron Staple] Change warning message text.
      c77e939 [Aaron Staple] [SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached data.
      3b6c511 [Aaron Staple] Minor doc example fixes.
      ff637c93
    • epahomov's avatar
      [SPARK-3690] Closing shuffle writers we swallow more important exception · 9b56e249
      epahomov authored
      Author: epahomov <pahomov.egor@gmail.com>
      
      Closes #2537 from epahomov/SPARK-3690 and squashes the following commits:
      
      a0b7de4 [epahomov] [SPARK-3690] Closing shuffle writers we swallow more important exception
      9b56e249
    • Sean Owen's avatar
      SPARK-2932 [STREAMING] Move MasterFailureTest out of "main" source directory · c3f2a858
      Sean Owen authored
      (HT @vanzin) Whatever the reason was for having this test class in `main`, if there is one, appear to be moot. This may have been a result of earlier streaming test reorganization.
      
      This simply puts `MasterFailureTest` back under `test/`, removes some redundant copied code, and touches up a few tiny inspection warnings along the way.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2399 from srowen/SPARK-2932 and squashes the following commits:
      
      3909411 [Sean Owen] Move MasterFailureTest to src/test, and remove redundant TestOutputStream
      c3f2a858
    • Marcelo Vanzin's avatar
      [SPARK-2778] [yarn] Add yarn integration tests. · b8487713
      Marcelo Vanzin authored
      This patch adds a couple of, currently, very simple integration tests
      to make sure both client and cluster modes are working. The tests don't
      do much yet other than run a simple job, but the plan is to enhance
      them after we get the framework in.
      
      The cluster tests are noisy, so redirect all log output to a file
      like other tests do. Copying the conf around sucks but it's less
      work than messing with maven/sbt and having to clean up other
      projects.
      
      Note the test is only added for yarn-stable. The code compiles
      against yarn-alpha but there are two issues I ran into that I
      could not overcome:
      - an old netty dependency kept creeping into the classpath and
        causing akka to not work, when using sbt; the old netty was
        correctly suppressed under maven.
      - MiniYARNCluster kept failing to execute containers because it
        did not create the NM's local dir itself; this is apparently
        a known behavior, but I'm not sure how to work around it.
      
      None of those issues are present with the stable Yarn.
      
      Also, these tests are a little slow to run. Apparently Spark doesn't
      yet tag tests (so that these could be isolated in a "slow" batch),
      so this is something to keep in mind.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #2257 from vanzin/yarn-tests and squashes the following commits:
      
      6d5b84e [Marcelo Vanzin] Fix wrong system property being set.
      8b0933d [Marcelo Vanzin] Merge branch 'master' into yarn-tests
      5c2b56f [Marcelo Vanzin] Use custom log4j conf for Yarn containers.
      ec73f17 [Marcelo Vanzin] More review feedback.
      67f5b02 [Marcelo Vanzin] Review feedback.
      f01517c [Marcelo Vanzin] Review feedback.
      68fbbbf [Marcelo Vanzin] Use older constructor available in older Hadoop releases.
      d07ef9a [Marcelo Vanzin] Merge branch 'master' into yarn-tests
      add8416 [Marcelo Vanzin] [SPARK-2778] [yarn] Add yarn integration tests.
      b8487713
  5. Sep 24, 2014
    • Aaron Staple's avatar
      [SPARK-546] Add full outer join to RDD and DStream. · 8ca4ecb6
      Aaron Staple authored
      leftOuterJoin and rightOuterJoin are already implemented.  This patch adds fullOuterJoin.
      
      Author: Aaron Staple <aaron.staple@gmail.com>
      
      Closes #1395 from staple/SPARK-546 and squashes the following commits:
      
      1f5595c [Aaron Staple] Fix python style
      7ac0aa9 [Aaron Staple] [SPARK-546] Add full outer join to RDD and DStream.
      3b5d137 [Aaron Staple] In JavaPairDStream, make class tag specification in rightOuterJoin consistent with other functions.
      31f2956 [Aaron Staple] Fix left outer join documentation comments.
      8ca4ecb6
    • jerryshao's avatar
      [SPARK-3615][Streaming]Fix Kafka unit test hard coded Zookeeper port issue · 74fb2ecf
      jerryshao authored
      Details can be seen in [SPARK-3615](https://issues.apache.org/jira/browse/SPARK-3615).
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #2483 from jerryshao/SPARK_3615 and squashes the following commits:
      
      8555563 [jerryshao] Fix Kafka unit test hard coded Zookeeper port issue
      74fb2ecf
    • Davies Liu's avatar
      [SPARK-3679] [PySpark] pickle the exact globals of functions · bb96012b
      Davies Liu authored
      function.func_code.co_names has all the names used in the function, including name of attributes. It will pickle some unnecessary globals if there is a global having the same name with attribute (in co_names).
      
      There is a regression introduced by #2144, revert part of changes in that PR.
      
      cc JoshRosen
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2522 from davies/globals and squashes the following commits:
      
      dfbccf5 [Davies Liu] fix bug while pickle globals of function
      bb96012b
    • Davies Liu's avatar
      [SPARK-3634] [PySpark] User's module should take precedence over system modules · c854b9fc
      Davies Liu authored
      Python modules added through addPyFile should take precedence over system modules.
      
      This patch put the path for user added module in the front of sys.path (just after '').
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2492 from davies/path and squashes the following commits:
      
      4a2af78 [Davies Liu] fix tests
      f7ff4da [Davies Liu] ad license header
      6b0002f [Davies Liu] add tests
      c16c392 [Davies Liu] put addPyFile in front of sys.path
      c854b9fc
    • Shivaram Venkataraman's avatar
      [SPARK-3659] Set EC2 version to 1.1.0 and update version map · 50f86336
      Shivaram Venkataraman authored
      This brings the master branch in sync with branch-1.1
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #2510 from shivaram/spark-ec2-version and squashes the following commits:
      
      bb0dd16 [Shivaram Venkataraman] Set EC2 version to 1.1.0 and update version map
      50f86336
    • Nicholas Chammas's avatar
      [Build] Diff from branch point · c4291260
      Nicholas Chammas authored
      Sometimes Jenkins posts [spurious reports of new classes being added](https://github.com/apache/spark/pull/2339#issuecomment-56570170). I believe this stems from diffing the patch against `master`, as opposed to against `master...`, which starts from the commit the PR was branched from.
      
      This patch fixes that behavior.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #2512 from nchammas/diff-only-commits-ahead and squashes the following commits:
      
      c065599 [Nicholas Chammas] comment typo fix
      a453c67 [Nicholas Chammas] diff from branch point
      c4291260
  6. Sep 23, 2014
    • Mubarak Seyed's avatar
      [SPARK-1853] Show Streaming application code context (file, line number) in Spark Stages UI · 729952a5
      Mubarak Seyed authored
      This is a refactored version of the original PR https://github.com/apache/spark/pull/1723 my mubarak
      
      Please take a look andrewor14, mubarak
      
      Author: Mubarak Seyed <mubarak.seyed@gmail.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #2464 from tdas/streaming-callsite and squashes the following commits:
      
      dc54c71 [Tathagata Das] Made changes based on PR comments.
      390b45d [Tathagata Das] Fixed minor bugs.
      904cd92 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-callsite
      7baa427 [Tathagata Das] Refactored getCallSite and setCallSite to make it simpler. Also added unit test for DStream creation site.
      b9ed945 [Mubarak Seyed] Adding streaming utils
      c461cf4 [Mubarak Seyed] Merge remote-tracking branch 'upstream/master'
      ceb43da [Mubarak Seyed] Changing default regex function name
      8c5d443 [Mubarak Seyed] Merge remote-tracking branch 'upstream/master'
      196121b [Mubarak Seyed] Merge remote-tracking branch 'upstream/master'
      491a1eb [Mubarak Seyed] Removing streaming visibility from getRDDCreationCallSite in DStream
      33a7295 [Mubarak Seyed] Fixing review comments: Merging both setCallSite methods
      c26d933 [Mubarak Seyed] Merge remote-tracking branch 'upstream/master'
      f51fd9f [Mubarak Seyed] Fixing scalastyle, Regex for Utils.getCallSite, and changing method names in DStream
      5051c58 [Mubarak Seyed] Getting return value of compute() into variable and call setCallSite(prevCallSite) only once. Adding return for other code paths (for None)
      a207eb7 [Mubarak Seyed] Fixing code review comments
      ccde038 [Mubarak Seyed] Removing Utils import from MappedDStream
      2a09ad6 [Mubarak Seyed] Changes in Utils.scala for SPARK-1853
      1d90cc3 [Mubarak Seyed] Changes for SPARK-1853
      5f3105a [Mubarak Seyed] Merge remote-tracking branch 'upstream/master'
      70f494f [Mubarak Seyed] Changes for SPARK-1853
      1500deb [Mubarak Seyed] Changes in Spark Streaming UI
      9d38d3c [Mubarak Seyed] [SPARK-1853] Show Streaming application code context (file, line number) in Spark Stages UI
      d466d75 [Mubarak Seyed] Changes for spark streaming UI
      729952a5
    • Andrew Or's avatar
      [SPARK-3653] Respect SPARK_*_MEMORY for cluster mode · b3fef50e
      Andrew Or authored
      `SPARK_DRIVER_MEMORY` was only used to start the `SparkSubmit` JVM, which becomes the driver only in client mode but not cluster mode. In cluster mode, this property is simply not propagated to the worker nodes.
      
      `SPARK_EXECUTOR_MEMORY` is picked up from `SparkContext`, but in cluster mode the driver runs on one of the worker machines, where this environment variable may not be set.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #2500 from andrewor14/memory-env-vars and squashes the following commits:
      
      6217b38 [Andrew Or] Respect SPARK_*_MEMORY for cluster mode
      b3fef50e
Loading