Skip to content
Snippets Groups Projects
  1. Sep 15, 2014
    • Prashant Sharma's avatar
      [SPARK-3433][BUILD] Fix for Mima false-positives with @DeveloperAPI and @Experimental annotations. · ecf0c029
      Prashant Sharma authored
      Actually false positive reported was due to mima generator not picking up the new jars in presence of old jars(theoretically this should not have happened.). So as a workaround, ran them both separately and just append them together.
      
      Author: Prashant Sharma <prashant@apache.org>
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #2285 from ScrapCodes/mima-fix and squashes the following commits:
      
      093c76f [Prashant Sharma] Update mima
      59012a8 [Prashant Sharma] Update mima
      35b6c71 [Prashant Sharma] SPARK-3433 Fix for Mima false-positives with @DeveloperAPI and @Experimental annotations.
      ecf0c029
    • Reynold Xin's avatar
      [SPARK-3540] Add reboot-slaves functionality to the ec2 script · d428ac6a
      Reynold Xin authored
      Tested on a real cluster.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2404 from rxin/ec2-reboot-slaves and squashes the following commits:
      
      00a2dbd [Reynold Xin] Allow rebooting slaves.
      d428ac6a
    • Aaron Staple's avatar
      [SPARK-1087] Move python traceback utilities into new traceback_utils.py file. · 60050f42
      Aaron Staple authored
      Also made some cosmetic cleanups.
      
      Author: Aaron Staple <aaron.staple@gmail.com>
      
      Closes #2385 from staple/SPARK-1087 and squashes the following commits:
      
      7b3bb13 [Aaron Staple] Address review comments, cosmetic cleanups.
      10ba6e1 [Aaron Staple] [SPARK-1087] Move python traceback utilities into new traceback_utils.py file.
      60050f42
    • Davies Liu's avatar
      [SPARK-2951] [PySpark] support unpickle array.array for Python 2.6 · da33acb8
      Davies Liu authored
      Pyrolite can not unpickle array.array which pickled by Python 2.6, this patch fix it by extend Pyrolite.
      
      There is a bug in Pyrolite when unpickle array of float/double, this patch workaround it by reverse the endianness for float/double. This workaround should be removed after Pyrolite have a new release to fix this issue.
      
      I had send an PR to Pyrolite to fix it:  https://github.com/irmen/Pyrolite/pull/11
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2365 from davies/pickle and squashes the following commits:
      
      f44f771 [Davies Liu] enable tests about array
      3908f5c [Davies Liu] Merge branch 'master' into pickle
      c77c87b [Davies Liu] cleanup debugging code
      60e4e2f [Davies Liu] support unpickle array.array for Python 2.6
      da33acb8
    • qiping.lqp's avatar
      [SPARK-3516] [mllib] DecisionTree: Add minInstancesPerNode, minInfoGain params... · fdb302f4
      qiping.lqp authored
      [SPARK-3516] [mllib] DecisionTree: Add minInstancesPerNode, minInfoGain params to example and Python API
      
      Added minInstancesPerNode, minInfoGain params to:
      * DecisionTreeRunner.scala example
      * Python API (tree.py)
      
      Also:
      * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements"
      * small style fixes
      
      CC: mengxr
      
      Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      Author: chouqin <liqiping1991@gmail.com>
      
      Closes #2349 from jkbradley/chouqin-dt-preprune and squashes the following commits:
      
      61b2e72 [Joseph K. Bradley] Added max of 10GB for maxMemoryInMB in Strategy.
      a95e7c8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
      95c479d [Joseph K. Bradley] * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements" * small style fixes
      e2628b6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
      19b01af [Joseph K. Bradley] Merge remote-tracking branch 'chouqin/dt-preprune' into chouqin-dt-preprune
      f1d11d1 [chouqin] fix typo
      c7ebaf1 [chouqin] fix typo
      39f9b60 [chouqin] change edge `minInstancesPerNode` to 2 and add one more test
      c6e2dfc [Joseph K. Bradley] Added minInstancesPerNode and minInfoGain parameters to DecisionTreeRunner.scala and to Python API in tree.py
      0278a11 [chouqin] remove `noSplit` and set `Predict` private to tree
      d593ec7 [chouqin] fix docs and change minInstancesPerNode to 1
      efcc736 [qiping.lqp] fix bug
      10b8012 [qiping.lqp] fix style
      6728fad [qiping.lqp] minor fix: remove empty lines
      bb465ca [qiping.lqp] Merge branch 'master' of https://github.com/apache/spark into dt-preprune
      cadd569 [qiping.lqp] add api docs
      46b891f [qiping.lqp] fix bug
      e72c7e4 [qiping.lqp] add comments
      845c6fa [qiping.lqp] fix style
      f195e83 [qiping.lqp] fix style
      987cbf4 [qiping.lqp] fix bug
      ff34845 [qiping.lqp] separate calculation of predict of node from calculation of info gain
      ac42378 [qiping.lqp] add min info gain and min instances per node parameters in decision tree
      fdb302f4
    • Reza Zadeh's avatar
      [MLlib] Update SVD documentation in IndexedRowMatrix · 983d6a9c
      Reza Zadeh authored
      Updating this to reflect the newest SVD via ARPACK
      
      Author: Reza Zadeh <rizlar@gmail.com>
      
      Closes #2389 from rezazadeh/irmdocs and squashes the following commits:
      
      7fa1313 [Reza Zadeh] Update svd docs
      715da25 [Reza Zadeh] Updated computeSVD documentation IndexedRowMatrix
      983d6a9c
    • Christoph Sawade's avatar
      [SPARK-3396][MLLIB] Use SquaredL2Updater in LogisticRegressionWithSGD · 3b931281
      Christoph Sawade authored
      SimpleUpdater ignores the regularizer, which leads to an unregularized
      LogReg. To enable the common L2 regularizer (and the corresponding
      regularization parameter) for logistic regression the SquaredL2Updater
      has to be used in SGD (see, e.g., [SVMWithSGD])
      
      Author: Christoph Sawade <christoph@sawade.me>
      
      Closes #2398 from BigCrunsh/fix-regparam-logreg and squashes the following commits:
      
      0820c04 [Christoph Sawade] Use SquaredL2Updater in LogisticRegressionWithSGD
      3b931281
    • yantangzhai's avatar
      [SPARK-2714] DAGScheduler logs jobid when runJob finishes · 37d92528
      yantangzhai authored
      DAGScheduler logs jobid when runJob finishes
      
      Author: yantangzhai <tyz0303@163.com>
      
      Closes #1617 from YanTangZhai/SPARK-2714 and squashes the following commits:
      
      0a0243f [yantangzhai] [SPARK-2714] DAGScheduler logs jobid when runJob finishes
      fbb1150 [yantangzhai] [SPARK-2714] DAGScheduler logs jobid when runJob finishes
      7aec2a9 [yantangzhai] [SPARK-2714] DAGScheduler logs jobid when runJob finishes
      fb42f0f [yantangzhai] [SPARK-2714] DAGScheduler logs jobid when runJob finishes
      090d908 [yantangzhai] [SPARK-2714] DAGScheduler logs jobid when runJob finishes
      37d92528
    • Kousuke Saruta's avatar
      [SPARK-3518] Remove wasted statement in JsonProtocol · e59fac1f
      Kousuke Saruta authored
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2380 from sarutak/SPARK-3518 and squashes the following commits:
      
      8a1464e [Kousuke Saruta] Replaced a variable with simple field reference
      c660fbc [Kousuke Saruta] Removed useless statement in JsonProtocol.scala
      e59fac1f
    • Matthew Farrellee's avatar
      [SPARK-3425] do not set MaxPermSize for OpenJDK 1.8 · fe2b1d6a
      Matthew Farrellee authored
      Closes #2387
      
      Author: Matthew Farrellee <matt@redhat.com>
      
      Closes #2301 from mattf/SPARK-3425 and squashes the following commits:
      
      20f3c09 [Matthew Farrellee] [SPARK-3425] do not set MaxPermSize for OpenJDK 1.8
      fe2b1d6a
    • Kousuke Saruta's avatar
      [SPARK-3410] The priority of shutdownhook for ApplicationMaster should not be integer literal · cc146444
      Kousuke Saruta authored
      I think, it need to keep the priority of shutdown hook for ApplicationMaster than the priority of shutdown hook for o.a.h.FileSystem depending on changing the priority for FileSystem.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2283 from sarutak/SPARK-3410 and squashes the following commits:
      
      1d44fef [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3410
      bd6cc53 [Kousuke Saruta] Modified style
      ee6f1aa [Kousuke Saruta] Added constant "SHUTDOWN_HOOK_PRIORITY" to ApplicationMaster
      54eb68f [Kousuke Saruta] Changed Shutdown hook priority to 20
      2f0aee3 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3410
      4c5cb93 [Kousuke Saruta] Modified the priority for AM's shutdown hook
      217d1a4 [Kousuke Saruta] Removed unused import statements
      717aba2 [Kousuke Saruta] Modified ApplicationMaster to make to keep the priority of shutdown hook for ApplicationMaster higher than the priority of shutdown hook for HDFS
      cc146444
  2. Sep 14, 2014
    • Prashant Sharma's avatar
      [SPARK-3452] Maven build should skip publishing artifacts people shouldn... · f493f798
      Prashant Sharma authored
      ...'t depend on
      
      Publish local in maven term is `install`
      
      and publish otherwise is `deploy`
      
      So disabled both for following projects.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #2329 from ScrapCodes/SPARK-3452/maven-skip-install and squashes the following commits:
      
      257b79a [Prashant Sharma] [SPARK-3452] Maven build should skip publishing artifacts people shouldn't depend on
      f493f798
    • Bertrand Bossy's avatar
      SPARK-3039: Allow spark to be built using avro-mapred for hadoop2 · c243b21a
      Bertrand Bossy authored
      SPARK-3039: Adds the maven property "avro.mapred.classifier" to build spark-assembly with avro-mapred with support for the new Hadoop API. Sets this property to hadoop2 for Hadoop 2 profiles.
      
      I am not very familiar with maven, nor do I know whether this potentially breaks something in the hive part of spark. There might be a more elegant way of doing this.
      
      Author: Bertrand Bossy <bertrandbossy@gmail.com>
      
      Closes #1945 from bbossy/SPARK-3039 and squashes the following commits:
      
      c32ce59 [Bertrand Bossy] SPARK-3039: Allow spark to be built using avro-mapred for hadoop2
      c243b21a
    • Davies Liu's avatar
      [SPARK-3463] [PySpark] aggregate and show spilled bytes in Python · 4e3fbe8c
      Davies Liu authored
      Aggregate the number of bytes spilled into disks during aggregation or sorting, show them in Web UI.
      
      ![spilled](https://cloud.githubusercontent.com/assets/40902/4209758/4b995562-386d-11e4-97c1-8e838ee1d4e3.png)
      
      This patch is blocked by SPARK-3465. (It includes a fix for that).
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2336 from davies/metrics and squashes the following commits:
      
      e37df38 [Davies Liu] remove outdated comments
      1245eb7 [Davies Liu] remove the temporary fix
      ebd2f43 [Davies Liu] Merge branch 'master' into metrics
      7e4ad04 [Davies Liu] Merge branch 'master' into metrics
      fbe9029 [Davies Liu] show spilled bytes in Python in web ui
      4e3fbe8c
  3. Sep 13, 2014
    • Davies Liu's avatar
      [SPARK-3030] [PySpark] Reuse Python worker · 2aea0da8
      Davies Liu authored
      Reuse Python worker to avoid the overhead of fork() Python process for each tasks. It also tracks the broadcasts for each worker, avoid sending repeated broadcasts.
      
      This can reduce the time for dummy task from 22ms to 13ms (-40%). It can help to reduce the latency for Spark Streaming.
      
      For a job with broadcast (43M after compress):
      ```
          b = sc.broadcast(set(range(30000000)))
          print sc.parallelize(range(24000), 100).filter(lambda x: x in b.value).count()
      ```
      It will finish in 281s without reused worker, and it will finish in 65s with reused worker(4 CPUs). After reusing the worker, it can save about 9 seconds for transfer and deserialize the broadcast for each tasks.
      
      It's enabled by default, could be disabled by `spark.python.worker.reuse = false`.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2259 from davies/reuse-worker and squashes the following commits:
      
      f11f617 [Davies Liu] Merge branch 'master' into reuse-worker
      3939f20 [Davies Liu] fix bug in serializer in mllib
      cf1c55e [Davies Liu] address comments
      3133a60 [Davies Liu] fix accumulator with reused worker
      760ab1f [Davies Liu] do not reuse worker if there are any exceptions
      7abb224 [Davies Liu] refactor: sychronized with itself
      ac3206e [Davies Liu] renaming
      8911f44 [Davies Liu] synchronized getWorkerBroadcasts()
      6325fc1 [Davies Liu] bugfix: bid >= 0
      e0131a2 [Davies Liu] fix name of config
      583716e [Davies Liu] only reuse completed and not interrupted worker
      ace2917 [Davies Liu] kill python worker after timeout
      6123d0f [Davies Liu] track broadcasts for each worker
      8d2f08c [Davies Liu] reuse python worker
      2aea0da8
    • Michael Armbrust's avatar
      [SQL] Decrease partitions when testing · 0f8c4edf
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2164 from marmbrus/shufflePartitions and squashes the following commits:
      
      0da1e8c [Michael Armbrust] test hax
      ef2d985 [Michael Armbrust] more test hacks.
      2dabae3 [Michael Armbrust] more test fixes
      0bdbf21 [Michael Armbrust] Make parquet tests less order dependent
      b42eeab [Michael Armbrust] increase test parallelism
      80453d5 [Michael Armbrust] Decrease partitions when testing
      0f8c4edf
    • Cheng Lian's avatar
      [SPARK-3294][SQL] Eliminates boxing costs from in-memory columnar storage · 74049249
      Cheng Lian authored
      This is a major refactoring of the in-memory columnar storage implementation, aims to eliminate boxing costs from critical paths (building/accessing column buffers) as much as possible. The basic idea is to refactor all major interfaces into a row-based form and use them together with `SpecificMutableRow`. The difficult part is how to adapt all compression schemes, esp. `RunLengthEncoding` and `DictionaryEncoding`, to this design. Since in-memory compression is disabled by default for now, and this PR should be strictly better than before no matter in-memory compression is enabled or not, maybe I'll finish that part in another PR.
      
      **UPDATE** This PR also took the chance to optimize `HiveTableScan` by
      
      1. leveraging `SpecificMutableRow` to avoid boxing cost, and
      1. building specific `Writable` unwrapper functions a head of time to avoid per row pattern matching and branching costs.
      
      TODO
      
      - [x] Benchmark
      - [ ] ~~Eliminate boxing costs in `RunLengthEncoding`~~ (left to future PRs)
      - [ ] ~~Eliminate boxing costs in `DictionaryEncoding` (seems not easy to do without specializing `DictionaryEncoding` for every supported column type)~~  (left to future PRs)
      
      ## Micro benchmark
      
      The benchmark uses a 10 million line CSV table consists of bytes, shorts, integers, longs, floats and doubles, measures the time to build the in-memory version of this table, and the time to scan the whole in-memory table.
      
      Benchmark code can be found [here](https://gist.github.com/liancheng/fe70a148de82e77bd2c8#file-hivetablescanbenchmark-scala). Script used to generate the input table can be found [here](https://gist.github.com/liancheng/fe70a148de82e77bd2c8#file-tablegen-scala).
      
      Speedup:
      
      - Hive table scanning + column buffer building: **18.74%**
      
        The original benchmark uses 1K as in-memory batch size, when increased to 10K, it can be 28.32% faster.
      
      - In-memory table scanning: **7.95%**
      
      Before:
      
              | Building | Scanning
      ------- | -------- | --------
      1       | 16472    | 525
      2       | 16168    | 530
      3       | 16386    | 529
      4       | 16184    | 538
      5       | 16209    | 521
      Average | 16283.8  | 528.6
      
      After:
      
              | Building | Scanning
      ------- | -------- | --------
      1       | 13124    | 458
      2       | 13260    | 529
      3       | 12981    | 463
      4       | 13214    | 483
      5       | 13583    | 500
      Average | 13232.4  | 486.6
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2327 from liancheng/prevent-boxing/unboxing and squashes the following commits:
      
      4419fe4 [Cheng Lian] Addressing comments
      e5d2cf2 [Cheng Lian] Bug fix: should call setNullAt when field value is null to avoid NPE
      8b8552b [Cheng Lian] Only checks for partition batch pruning flag once
      489f97b [Cheng Lian] Bug fix: TableReader.fillObject uses wrong ordinals
      97bbc4e [Cheng Lian] Optimizes hive.TableReader by by providing specific Writable unwrappers a head of time
      3dc1f94 [Cheng Lian] Minor changes to eliminate row object creation
      5b39cb9 [Cheng Lian] Lowers log level of compression scheme details
      f2a7890 [Cheng Lian] Use SpecificMutableRow in InMemoryColumnarTableScan to avoid boxing
      9cf30b0 [Cheng Lian] Added row based ColumnType.append/extract
      456c366 [Cheng Lian] Made compression decoder row based
      edac3cd [Cheng Lian] Makes ColumnAccessor.extractSingle row based
      8216936 [Cheng Lian] Removes boxing cost in IntDelta and LongDelta by providing specialized implementations
      b70d519 [Cheng Lian] Made some in-memory columnar storage interfaces row-based
      74049249
    • Cheng Lian's avatar
      [SPARK-3481][SQL] Removes the evil MINOR HACK · 184cd51c
      Cheng Lian authored
      This is a follow up of #2352. Now we can finally remove the evil "MINOR HACK", which covered up the eldest bug in the history of Spark SQL (see details [here](https://github.com/apache/spark/pull/2352#issuecomment-55440621)).
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2377 from liancheng/remove-evil-minor-hack and squashes the following commits:
      
      0869c78 [Cheng Lian] Removes the evil MINOR HACK
      184cd51c
    • Nicholas Chammas's avatar
      [SQL] [Docs] typo fixes · a523ceaf
      Nicholas Chammas authored
      * Fixed random typo
      * Added in missing description for DecimalType
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #2367 from nchammas/patch-1 and squashes the following commits:
      
      aa528be [Nicholas Chammas] doc fix for SQL DecimalType
      3247ac1 [Nicholas Chammas] [SQL] [Docs] typo fixes
      a523ceaf
    • Reynold Xin's avatar
      Proper indent for the previous commit. · b4dded40
      Reynold Xin authored
      b4dded40
    • Sean Owen's avatar
      SPARK-3470 [CORE] [STREAMING] Add Closeable / close() to Java context objects · feaa3706
      Sean Owen authored
      ...  that expose a stop() lifecycle method. This doesn't add `AutoCloseable`, which is Java 7+ only. But it should be possible to use try-with-resources on a `Closeable` in Java 7, as long as the `close()` does not throw a checked exception, and these don't. Q.E.D.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2346 from srowen/SPARK-3470 and squashes the following commits:
      
      612c21d [Sean Owen] Add Closeable / close() to Java context objects that expose a stop() lifecycle method
      feaa3706
  4. Sep 12, 2014
    • Yin Huai's avatar
      [SQL][Docs] Update SQL programming guide to show the correct default value of... · e11eeb71
      Yin Huai authored
      [SQL][Docs] Update SQL programming guide to show the correct default value of containsNull in an ArrayType
      
      After #1889, the default value of `containsNull` in an `ArrayType` is `true`.
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #2374 from yhuai/containsNull and squashes the following commits:
      
      dc609a3 [Yin Huai] Update the SQL programming guide to show the correct default value of containsNull in an ArrayType (the default value is true instead of false).
      e11eeb71
    • Reynold Xin's avatar
      [SPARK-3469] Make sure all TaskCompletionListener are called even with failures · 2584ea5b
      Reynold Xin authored
      This is necessary because we rely on this callback interface to clean resources up. The old behavior would lead to resource leaks.
      
      Note that this also changes the fault semantics of TaskCompletionListener. Previously failures in TaskCompletionListeners would result in the task being reported immediately. With this change, we report the exception at the end, and the reported exception is a TaskCompletionListenerException that contains all the exception messages.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2343 from rxin/taskcontext-callback and squashes the following commits:
      
      a3845b2 [Reynold Xin] Mark TaskCompletionListenerException as private[spark].
      ac5baea [Reynold Xin] Removed obsolete comment.
      aa68ea4 [Reynold Xin] Throw an exception if task completion callback fails.
      29b6162 [Reynold Xin] oops compilation failed.
      1cb444d [Reynold Xin] [SPARK-3469] Call all TaskCompletionListeners even if some fail.
      2584ea5b
    • Cheng Lian's avatar
      [SPARK-3515][SQL] Moves test suite setup code to beforeAll rather than in constructor · 6d887db7
      Cheng Lian authored
      Please refer to the JIRA ticket for details.
      
      **NOTE** We should check all test suites that do similar initialization-like side effects in their constructors. This PR only fixes `ParquetMetastoreSuite` because it breaks our Jenkins Maven build.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2375 from liancheng/say-no-to-constructor and squashes the following commits:
      
      0ceb75b [Cheng Lian] Moves test suite setup code to beforeAll rather than in constructor
      6d887db7
    • Davies Liu's avatar
      [SPARK-3500] [SQL] use JavaSchemaRDD as SchemaRDD._jschema_rdd · 885d1621
      Davies Liu authored
      Currently, SchemaRDD._jschema_rdd is SchemaRDD, the Scala API (coalesce(), repartition()) can not been called in Python easily, there is no way to specify the implicit parameter `ord`. The _jrdd is an JavaRDD, so _jschema_rdd should also be JavaSchemaRDD.
      
      In this patch, change _schema_rdd to JavaSchemaRDD, also added an assert for it. If some methods are missing from JavaSchemaRDD, then it's called by _schema_rdd.baseSchemaRDD().xxx().
      
      BTW, Do we need JavaSQLContext?
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2369 from davies/fix_schemardd and squashes the following commits:
      
      abee159 [Davies Liu] use JavaSchemaRDD as SchemaRDD._jschema_rdd
      885d1621
    • Davies Liu's avatar
      [SPARK-3094] [PySpark] compatitable with PyPy · 71af030b
      Davies Liu authored
      After this patch, we can run PySpark in PyPy (testing with PyPy 2.3.1 in Mac 10.9), for example:
      
      ```
      PYSPARK_PYTHON=pypy ./bin/spark-submit wordcount.py
      ```
      
      The performance speed up will depend on work load (from 20% to 3000%). Here are some benchmarks:
      
       Job | CPython 2.7 | PyPy 2.3.1  | Speed up
       ------- | ------------ | ------------- | -------
       Word Count | 41s   | 15s  | 2.7x
       Sort | 46s |  44s | 1.05x
       Stats | 174s | 3.6s | 48x
      
      Here is the code used for benchmark:
      
      ```python
      rdd = sc.textFile("text")
      def wordcount():
          rdd.flatMap(lambda x:x.split('/'))\
              .map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap()
      def sort():
          rdd.sortBy(lambda x:x, 1).count()
      def stats():
          sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats()
      ```
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2144 from davies/pypy and squashes the following commits:
      
      9aed6c5 [Davies Liu] use protocol 2 in CloudPickle
      4bc1f04 [Davies Liu] refactor
      b20ab3a [Davies Liu] pickle sys.stdout and stderr in portable way
      3ca2351 [Davies Liu] Merge branch 'master' into pypy
      fae8b19 [Davies Liu] improve attrgetter, add tests
      591f830 [Davies Liu] try to run tests with PyPy in run-tests
      c8d62ba [Davies Liu] cleanup
      f651fd0 [Davies Liu] fix tests using array with PyPy
      1b98fb3 [Davies Liu] serialize itemgetter/attrgetter in portable ways
      3c1dbfe [Davies Liu] Merge branch 'master' into pypy
      42fb5fa [Davies Liu] Merge branch 'master' into pypy
      cb2d724 [Davies Liu] fix tests
      9986692 [Davies Liu] Merge branch 'master' into pypy
      25b4ca7 [Davies Liu] support PyPy
      71af030b
    • Thomas Graves's avatar
      [SPARK-3456] YarnAllocator on alpha can lose container requests to RM · 25311c2c
      Thomas Graves authored
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #2373 from tgravescs/SPARK-3456 and squashes the following commits:
      
      77e9532 [Thomas Graves] [SPARK-3456] YarnAllocator on alpha can lose container requests to RM
      25311c2c
    • Marcelo Vanzin's avatar
      [SPARK-3217] Add Guava to classpath when SPARK_PREPEND_CLASSES is set. · af258382
      Marcelo Vanzin authored
      When that option is used, the compiled classes from the build directory
      are prepended to the classpath. Now that we avoid packaging Guava, that
      means we have classes referencing the original Guava location in the app's
      classpath, so errors happen.
      
      For that case, add Guava manually to the classpath.
      
      Note: if Spark is compiled with "-Phadoop-provided", it's tricky to
      make things work with SPARK_PREPEND_CLASSES, because you need to add
      the Hadoop classpath using SPARK_CLASSPATH and that means the older
      Hadoop Guava overrides the newer one Spark needs. So someone using
      SPARK_PREPEND_CLASSES needs to remember to not use that profile.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #2141 from vanzin/SPARK-3217 and squashes the following commits:
      
      b967324 [Marcelo Vanzin] [SPARK-3217] Add Guava to classpath when SPARK_PREPEND_CLASSES is set.
      af258382
    • Sandy Ryza's avatar
      SPARK-3014. Log a more informative messages in a couple failure scenario... · 1d767967
      Sandy Ryza authored
      ...s
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #1934 from sryza/sandy-spark-3014 and squashes the following commits:
      
      ae19cc1 [Sandy Ryza] SPARK-3014. Log a more informative messages in a couple failure scenarios
      1d767967
    • Ankur Dave's avatar
      [SPARK-3427] [GraphX] Avoid active vertex tracking in static PageRank · 15a56459
      Ankur Dave authored
      GraphX's current implementation of static (fixed iteration count) PageRank uses the Pregel API. This unnecessarily tracks active vertices, even though in static PageRank all vertices are always active. Active vertex tracking incurs the following costs:
      
      1. A shuffle per iteration to ship the active sets to the edge partitions.
      2. A hash table creation per iteration at each partition to index the active sets for lookup.
      3. A hash lookup per edge to check whether the source vertex is active.
      
      I reimplemented static PageRank using the lower-level GraphX API instead of the Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this provided a 23% speedup (from 514 s to 397 s, mean over 3 trials) for 10 iterations of PageRank on a synthetic graph with 10M vertices and 1.27B edges.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #2308 from ankurdave/SPARK-3427 and squashes the following commits:
      
      449996a [Ankur Dave] Avoid unnecessary active vertex tracking in static PageRank
      15a56459
    • Patrick Wendell's avatar
      MAINTENANCE: Automated closing of pull requests. · eae81b0b
      Patrick Wendell authored
      This commit exists to close the following pull requests on Github:
      
      Closes #930 (close requested by 'andrewor14')
      Closes #867 (close requested by 'marmbrus')
      Closes #1829 (close requested by 'marmbrus')
      Closes #1131 (close requested by 'JoshRosen')
      Closes #1571 (close requested by 'andrewor14')
      Closes #2359 (close requested by 'andrewor14')
      eae81b0b
    • Cheng Hao's avatar
      [SPARK-3481] [SQL] Eliminate the error log in local Hive comparison test · 8194fc66
      Cheng Hao authored
      Logically, we should remove the Hive Table/Database first and then reset the Hive configuration, repoint to the new data warehouse directory etc.
      Otherwise it raised exceptions like "Database doesn't not exists: default" in the local testing.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #2352 from chenghao-intel/test_hive and squashes the following commits:
      
      74fd76b [Cheng Hao] eliminate the error log
      8194fc66
    • RJ Nowling's avatar
      [PySpark] Add blank line so that Python RDD.top() docstring renders correctly · 53337762
      RJ Nowling authored
      Author: RJ Nowling <rnowling@gmail.com>
      
      Closes #2370 from rnowling/python_rdd_docstrings and squashes the following commits:
      
      5230574 [RJ Nowling] Add blank line so that Python RDD.top() docstring renders correctly
      53337762
    • Mark G. Whitney's avatar
      [SPARK-2558][DOCS] Add --queue example to YARN doc · f116f76b
      Mark G. Whitney authored
      Put original YARN queue spark-submit arg description in
      running-on-yarn html table and example command line
      
      Author: Mark G. Whitney <mark@whitneyindustries.com>
      
      Closes #2218 from kramimus/2258-yarndoc and squashes the following commits:
      
      4b5d808 [Mark G. Whitney] remove yarn queue config
      f8cda0d [Mark G. Whitney] [SPARK-2558][DOCS] Add spark.yarn.queue description to YARN doc
      f116f76b
    • Joseph K. Bradley's avatar
      [SPARK-3160] [SPARK-3494] [mllib] DecisionTree: eliminate pre-allocated... · b8634df1
      Joseph K. Bradley authored
      [SPARK-3160] [SPARK-3494] [mllib]  DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays. Memory calc bug fix.
      
      This PR includes some code simplifications and re-organization which will be helpful for implementing random forests.  The main changes are that the nodes and parentImpurities arrays are no longer pre-allocated in the main train() method.
      
      Also added 2 bug fixes:
      * maxMemoryUsage calculation
      * over-allocation of space for bins in DTStatsAggregator for unordered features.
      
      Relation to RFs:
      * Since RFs will be deeper and will therefore be more likely sparse (not full trees), it could be a cost savings to avoid pre-allocating a full tree.
      * The associated re-organization also reduces bookkeeping, which will make RFs easier to implement.
      * The return code doneTraining may be generalized to include cases such as nodes ready for local training.
      
      Details:
      
      No longer pre-allocate parentImpurities array in main train() method.
      * parentImpurities values are now stored in individual nodes (in Node.stats.impurity).
      * These were not really needed.  They were used in calculateGainForSplit(), but they can be calculated anyways using parentNodeAgg.
      
      No longer using Node.build since tree structure is constructed on-the-fly.
      * Did not eliminate since it is public (Developer) API.  Marked as deprecated.
      
      Eliminated pre-allocated nodes array in main train() method.
      * Nodes are constructed and added to the tree structure as needed during training.
      * Moved tree construction from main train() method into findBestSplitsPerGroup() since there is no need to keep the (split, gain) array for an entire level of nodes.  Only one element of that array is needed at a time, so we do not the array.
      
      findBestSplits() now returns 2 items:
      * rootNode (newly created root node on first iteration, same root node on later iterations)
      * doneTraining (indicating if all nodes at that level were leafs)
      
      Updated DecisionTreeSuite.  Notes:
      * Improved test "Second level node building with vs. without groups"
      ** generateOrderedLabeledPoints() modified so that it really does require 2 levels of internal nodes.
      * Related update: Added Node.deepCopy (private[tree]), used for test suite
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #2341 from jkbradley/dt-spark-3160 and squashes the following commits:
      
      07dd1ee [Joseph K. Bradley] Fixed overflow bug with computing maxMemoryUsage in DecisionTree.  Also fixed bug with over-allocating space in DTStatsAggregator for unordered features.
      debe072 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160
      5c4ac33 [Joseph K. Bradley] Added check in Strategy to make sure minInstancesPerNode >= 1
      0dd4d87 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160
      306120f [Joseph K. Bradley] Fixed typo in DecisionTreeModel.scala doc
      eaa1dcf [Joseph K. Bradley] Added topNode doc in DecisionTree and scalastyle fix
      d4d7864 [Joseph K. Bradley] Marked Node.build as deprecated
      d4dbb99 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160
      1a8f0ad [Joseph K. Bradley] Eliminated pre-allocated nodes array in main train() method. * Nodes are constructed and added to the tree structure as needed during training.
      2ab763b [Joseph K. Bradley] Simplifications to DecisionTree code:
      b8634df1
  5. Sep 11, 2014
    • Davies Liu's avatar
      [SPARK-3465] fix task metrics aggregation in local mode · 42904b8d
      Davies Liu authored
      Before overwrite t.taskMetrics, take a deepcopy of it.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2338 from davies/fix_metric and squashes the following commits:
      
      a5cdb63 [Davies Liu] Merge branch 'master' into fix_metric
      7c879e0 [Davies Liu] add more comments
      754b5b8 [Davies Liu] copy taskMetrics only when isLocal is true
      5ca26dc [Davies Liu] fix task metrics aggregation in local mode
      42904b8d
    • witgo's avatar
      SPARK-2482: Resolve sbt warnings during build · 33c7a738
      witgo authored
      At the same time, import the `scala.language.postfixOps` and ` org.scalatest.time.SpanSugar._` cause `scala.language.postfixOps` doesn't work
      
      Author: witgo <witgo@qq.com>
      
      Closes #1330 from witgo/sbt_warnings3 and squashes the following commits:
      
      179ba61 [witgo] Resolve sbt warnings during build
      33c7a738
    • Cody Koeninger's avatar
      SPARK-3462 push down filters and projections into Unions · f858f466
      Cody Koeninger authored
      Author: Cody Koeninger <cody.koeninger@mediacrossing.com>
      
      Closes #2345 from koeninger/SPARK-3462 and squashes the following commits:
      
      5c8d24d [Cody Koeninger] SPARK-3462 remove now-unused parameter
      0788691 [Cody Koeninger] SPARK-3462 add tests, handle compatible schema with different aliases, per marmbrus feedback
      ef47b3b [Cody Koeninger] SPARK-3462 push down filters and projections into Unions
      f858f466
    • Andrew Ash's avatar
      [SPARK-3429] Don't include the empty string "" as a defaultAclUser · ce59725b
      Andrew Ash authored
      Changes logging from
      
      ```
      14/09/05 02:01:08 INFO SecurityManager: Changing view acls to: aash,
      14/09/05 02:01:08 INFO SecurityManager: Changing modify acls to: aash,
      14/09/05 02:01:08 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(aash, ); users with modify permissions: Set(aash, )
      ```
      to
      ```
      14/09/05 02:28:28 INFO SecurityManager: Changing view acls to: aash
      14/09/05 02:28:28 INFO SecurityManager: Changing modify acls to: aash
      14/09/05 02:28:28 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(aash); users with modify permissions: Set(aash)
      ```
      
      Note that the first set of logs have a Set of size 2 containing "aash" and the empty string ""
      
      cc tgravescs
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #2286 from ash211/empty-default-acl and squashes the following commits:
      
      18cc612 [Andrew Ash] Use .isEmpty instead of ==""
      cf973a1 [Andrew Ash] Don't include the empty string "" as a defaultAclUser
      ce59725b
    • Andrew Or's avatar
      [Spark-3490] Disable SparkUI for tests · 6324eb7b
      Andrew Or authored
      We currently open many ephemeral ports during the tests, and as a result we occasionally can't bind to new ones. This has caused the `DriverSuite` and the `SparkSubmitSuite` to fail intermittently.
      
      By disabling the `SparkUI` when it's not needed, we already cut down on the number of ports opened significantly, on the order of the number of `SparkContexts` ever created. We must keep it enabled for a few tests for the UI itself, however.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #2363 from andrewor14/disable-ui-for-tests and squashes the following commits:
      
      332a7d5 [Andrew Or] No need to set spark.ui.port to 0 anymore
      30c93a2 [Andrew Or] Simplify streaming UISuite
      a431b84 [Andrew Or] Fix streaming test failures
      8f5ae53 [Andrew Or] Fix no new line at the end
      29c9b5b [Andrew Or] Disable SparkUI for tests
      6324eb7b
Loading