Skip to content
Snippets Groups Projects
  1. Sep 17, 2014
    • Nicholas Chammas's avatar
      [Docs] minor grammar fix · 8fbd5f4a
      Nicholas Chammas authored
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #2430 from nchammas/patch-2 and squashes the following commits:
      
      d476bfb [Nicholas Chammas] [Docs] minor grammar fix
      8fbd5f4a
    • chesterxgchen's avatar
      SPARK-3177 (on Master Branch) · 7d1a3723
      chesterxgchen authored
      The JIRA and PR was original created for branch-1.1, and move to master branch now.
      Chester
      
      The Issue is due to that yarn-alpha and yarn have different APIs for certain class fields. In this particular case,  the ClientBase using reflection to to address this issue, and we need to different way to test the ClientBase's method.  Original ClientBaseSuite using getFieldValue() method to do this. But it doesn't work for yarn-alpha as the API returns an array of String instead of just String (which is the case for Yarn-stable API).
      
       To fix the test, I add a new method
      
        def getFieldValue2[A: ClassTag, A1: ClassTag, B](clazz: Class[_], field: String,
                                                            defaults: => B)
                                    (mapTo:  A => B)(mapTo1: A1 => B) : B =
          Try(clazz.getField(field)).map(_.get(null)).map {
            case v: A => mapTo(v)
            case v1: A1 => mapTo1(v1)
            case _ => defaults
          }.toOption.getOrElse(defaults)
      
      to handle the cases where the field type can be either type A or A1. In this new method the type A or A1 is pattern matched and corresponding mapTo function (mapTo or mapTo1) is used.
      
      Author: chesterxgchen <chester@alpinenow.com>
      
      Closes #2204 from chesterxgchen/SPARK-3177-master and squashes the following commits:
      
      e72a6ea [chesterxgchen]  The Issue is due to that yarn-alpha and yarn have different APIs for certain class fields. In this particular case,  the ClientBase using reflection to to address this issue, and we need to different way to test the ClientBase's method.  Original ClientBaseSuite using getFieldValue() method to do this. But it doesn't work for yarn-alpha as the API returns an array of String instead of just String (which is the case for Yarn-stable API).
      7d1a3723
    • viper-kun's avatar
      [Docs] Correct spark.files.fetchTimeout default value · 983609a4
      viper-kun authored
      change the value of spark.files.fetchTimeout
      
      Author: viper-kun <xukun.xu@huawei.com>
      
      Closes #2406 from viper-kun/master and squashes the following commits:
      
      ecb0d46 [viper-kun] [Docs] Correct spark.files.fetchTimeout default value
      7cf4c7a [viper-kun] Update configuration.md
      983609a4
  2. Sep 16, 2014
    • wangfei's avatar
      [Minor]ignore all config files in conf · 008a5ed4
      wangfei authored
      Some config files in ```conf``` should ignore, such as
              conf/fairscheduler.xml
              conf/hive-log4j.properties
              conf/metrics.properties
      ...
      So ignore all ```sh```/```properties```/```conf```/```xml``` files
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #2395 from scwf/patch-2 and squashes the following commits:
      
      3dc53f2 [wangfei] duplicate ```conf/*.conf```
      3c2986f [wangfei] ignore all config files
      008a5ed4
    • Andrew Or's avatar
      [SPARK-3555] Fix UISuite race condition · 0a7091e6
      Andrew Or authored
      The test "jetty selects different port under contention" is flaky.
      
      If another process binds to 4040 before the test starts, then the first server we start there will fail, and the subsequent servers we start thereafter may successfully bind to 4040 if it was released between the servers starting. Instead, we should just let Java find a random free port for us and hold onto it for the duration of the test.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #2418 from andrewor14/fix-port-contention and squashes the following commits:
      
      0cd4974 [Andrew Or] Stop them servers
      a7071fe [Andrew Or] Pick random port instead of 4040
      0a7091e6
    • Evan Chan's avatar
      Add a Community Projects page · a6e1712f
      Evan Chan authored
      This adds a new page to the docs listing community projects -- those created outside of Apache Spark that are of interest to the community of Spark users.   Anybody can add to it just by submitting a PR.
      
      There was a discussion thread about alternatives:
      * Creating a Github organization for Spark projects -  we could not find any sponsors for this, and it would be difficult to organize since many folks just create repos in their company organization or personal accounts
      * Apache has some place for storing community projects, but it was deemed difficult to work with, and again would be some permissions issues -- not everyone could update it.
      
      Author: Evan Chan <velvia@gmail.com>
      
      Closes #2219 from velvia/community-projects-page and squashes the following commits:
      
      7316822 [Evan Chan] Point to Spark wiki: supplemental projects page
      613b021 [Evan Chan] Add a few more projects
      a85eaaf [Evan Chan] Add a Community Projects page
      a6e1712f
    • Dan Osipov's avatar
      [SPARK-787] Add S3 configuration parameters to the EC2 deploy scripts · b2017126
      Dan Osipov authored
      When deploying to AWS, there is additional configuration that is required to read S3 files. EMR creates it automatically, there is no reason that the Spark EC2 script shouldn't.
      
      This PR requires a corresponding PR to the mesos/spark-ec2 to be merged, as it gets cloned in the process of setting up machines: https://github.com/mesos/spark-ec2/pull/58
      
      Author: Dan Osipov <daniil.osipov@shazam.com>
      
      Closes #1120 from danosipov/s3_credentials and squashes the following commits:
      
      758da8b [Dan Osipov] Modify documentation to include the new parameter
      71fab14 [Dan Osipov] Use a parameter --copy-aws-credentials to enable S3 credential deployment
      7e0da26 [Dan Osipov] Get AWS credentials out of boto connection instance
      39bdf30 [Dan Osipov] Add S3 configuration parameters to the EC2 deploy scripts
      b2017126
    • Davies Liu's avatar
      [SPARK-3430] [PySpark] [Doc] generate PySpark API docs using Sphinx · ec1adecb
      Davies Liu authored
      Using Sphinx to generate API docs for PySpark.
      
      requirement: Sphinx
      
      ```
      $ cd python/docs/
      $ make html
      ```
      
      The generated API docs will be located at python/docs/_build/html/index.html
      
      It can co-exists with those generated by Epydoc.
      
      This is the first working version, after merging in, then we can continue to improve it and replace the epydoc finally.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2292 from davies/sphinx and squashes the following commits:
      
      425a3b1 [Davies Liu] cleanup
      1573298 [Davies Liu] move docs to python/docs/
      5fe3903 [Davies Liu] Merge branch 'master' into sphinx
      9468ab0 [Davies Liu] fix makefile
      b408f38 [Davies Liu] address all comments
      e2ccb1b [Davies Liu] update name and version
      9081ead [Davies Liu] generate PySpark API docs using Sphinx
      ec1adecb
    • Kousuke Saruta's avatar
      [SPARK-3546] InputStream of ManagedBuffer is not closed and causes running out of file descriptor · a9e91043
      Kousuke Saruta authored
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2408 from sarutak/resolve-resource-leak-issue and squashes the following commits:
      
      074781d [Kousuke Saruta] Modified SuffleBlockFetcherIterator
      5f63f67 [Kousuke Saruta] Move metrics increment logic and debug logging outside try block
      b37231a [Kousuke Saruta] Modified FileSegmentManagedBuffer#nioByteBuffer to check null or not before invoking channel.close
      bf29d4a [Kousuke Saruta] Modified FileSegment to close channel
      a9e91043
    • Michael Armbrust's avatar
      [SQL][DOCS] Improve section on thrift-server · 84073eb1
      Michael Armbrust authored
      Taken from liancheng's updates. Merged conflicts with #2316.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2384 from marmbrus/sqlDocUpdate and squashes the following commits:
      
      2db6319 [Michael Armbrust] @liancheng's updates
      84073eb1
    • Nicholas Chammas's avatar
      [Docs] minor punctuation fix · df90e81f
      Nicholas Chammas authored
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #2414 from nchammas/patch-1 and squashes the following commits:
      
      14664bf [Nicholas Chammas] [Docs] minor punctuation fix
      df90e81f
    • Aaron Staple's avatar
      [SPARK-2314][SQL] Override collect and take in python library, and count in... · 8e7ae477
      Aaron Staple authored
      [SPARK-2314][SQL] Override collect and take in python library, and count in java library, with optimized versions.
      
      SchemaRDD overrides RDD functions, including collect, count, and take, with optimized versions making use of the query optimizer.  The java and python interface classes wrapping SchemaRDD need to ensure the optimized versions are called as well.  This patch overrides relevant calls in the python and java interfaces with optimized versions.
      
      Adds a new Row serialization pathway between python and java, based on JList[Array[Byte]] versus the existing RDD[Array[Byte]]. I wasn’t overjoyed about doing this, but I noticed that some QueryPlans implement optimizations in executeCollect(), which outputs an Array[Row] rather than the typical RDD[Row] that can be shipped to python using the existing serialization code. To me it made sense to ship the Array[Row] over to python directly instead of converting it back to an RDD[Row] just for the purpose of sending the Rows to python using the existing serialization code.
      
      Author: Aaron Staple <aaron.staple@gmail.com>
      
      Closes #1592 from staple/SPARK-2314 and squashes the following commits:
      
      89ff550 [Aaron Staple] Merge with master.
      6bb7b6c [Aaron Staple] Fix typo.
      b56d0ac [Aaron Staple] [SPARK-2314][SQL] Override count in JavaSchemaRDD, forwarding to SchemaRDD's count.
      0fc9d40 [Aaron Staple] Fix comment typos.
      f03cdfa [Aaron Staple] [SPARK-2314][SQL] Override collect and take in sql.py, forwarding to SchemaRDD's collect.
      8e7ae477
    • Michael Armbrust's avatar
      [SPARK-2890][SQL] Allow reading of data when case insensitive resolution could... · 30f288ae
      Michael Armbrust authored
      [SPARK-2890][SQL] Allow reading of data when case insensitive resolution could cause possible ambiguity.
      
      Throwing an error in the constructor makes it possible to run queries, even when there is no actual ambiguity.  Remove this check in favor of throwing an error in analysis when they query is actually is ambiguous.
      
      Also took the opportunity to add test cases that would have caught a subtle bug in my first attempt at fixing this and refactor some other test code.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2209 from marmbrus/sameNameStruct and squashes the following commits:
      
      729cca4 [Michael Armbrust] Better tests.
      a003aeb [Michael Armbrust] Remove error (it'll be caught in analysis).
      30f288ae
    • Yin Huai's avatar
      [SPARK-3308][SQL] Ability to read JSON Arrays as tables · 75836998
      Yin Huai authored
      This PR aims to support reading top level JSON arrays and take every element in such an array as a row (an empty array will not generate a row).
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-3308
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #2400 from yhuai/SPARK-3308 and squashes the following commits:
      
      990077a [Yin Huai] Handle top level JSON arrays.
      75836998
    • Matthew Farrellee's avatar
      [SPARK-3519] add distinct(n) to PySpark · 9d5fa763
      Matthew Farrellee authored
      Added missing rdd.distinct(numPartitions) and associated tests
      
      Author: Matthew Farrellee <matt@redhat.com>
      
      Closes #2383 from mattf/SPARK-3519 and squashes the following commits:
      
      30b837a [Matthew Farrellee] Combine test cases to save on JVM startups
      6bc4a2c [Matthew Farrellee] [SPARK-3519] add distinct(n) to SchemaRDD in PySpark
      7a17f2b [Matthew Farrellee] [SPARK-3519] add distinct(n) to PySpark
      9d5fa763
    • Cheng Hao's avatar
      [SPARK-3527] [SQL] Strip the string message · 86d253ec
      Cheng Hao authored
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #2392 from chenghao-intel/trim and squashes the following commits:
      
      e52024f [Cheng Hao] trim the string message
      86d253ec
    • Prashant Sharma's avatar
      [SPARK-2182] Scalastyle rule blocking non ascii characters. · 7b8008f5
      Prashant Sharma authored
      ...erators.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #2358 from ScrapCodes/scalastyle-unicode and squashes the following commits:
      
      12a20f2 [Prashant Sharma] [SPARK-2182] Scalastyle rule blocking (non keyboard typeable) unicode operators.
      7b8008f5
    • Sean Owen's avatar
      SPARK-3069 [DOCS] Build instructions in README are outdated · 61e21fe7
      Sean Owen authored
      Here's my crack at Bertrand's suggestion. The Github `README.md` contains build info that's outdated. It should just point to the current online docs, and reflect that Maven is the primary build now.
      
      (Incidentally, the stanza at the end about contributions of original work should go in https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark too. It won't hurt to be crystal clear about the agreement to license, given that ICLAs are not required of anyone here.)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2014 from srowen/SPARK-3069 and squashes the following commits:
      
      501507e [Sean Owen] Note that Zinc is for Maven builds too
      db2bd97 [Sean Owen] sbt -> sbt/sbt and add note about zinc
      be82027 [Sean Owen] Fix additional occurrences of building-with-maven -> building-spark
      91c921f [Sean Owen] Move building-with-maven to building-spark and create a redirect. Update doc links to building-spark.html Add jekyll-redirect-from plugin and make associated config changes (including fixing pygments deprecation). Add example of SBT to README.md
      999544e [Sean Owen] Change "Building Spark with Maven" title to "Building Spark"; reinstate tl;dr info about dev/run-tests in README.md; add brief note about building with SBT
      c18d140 [Sean Owen] Optionally, remove the copy of contributing text from main README.md
      8e83934 [Sean Owen] Add CONTRIBUTING.md to trigger notice on new pull request page
      b1c04a1 [Sean Owen] Refer to current online documentation for building, and remove slightly outdated copy in README.md
      61e21fe7
  3. Sep 15, 2014
    • Ye Xianjin's avatar
      [SPARK-3040] pick up a more proper local ip address for Utils.findLocalIpAddress method · febafefa
      Ye Xianjin authored
      Short version: NetworkInterface.getNetworkInterfaces returns ifs in reverse order compared to ifconfig output. It may pick up ip address associated with tun0 or virtual network interface.
      See [SPARK_3040](https://issues.apache.org/jira/browse/SPARK-3040) for more detail
      
      Author: Ye Xianjin <advancedxy@gmail.com>
      
      Closes #1946 from advancedxy/SPARK-3040 and squashes the following commits:
      
      f33f6b2 [Ye Xianjin] add windows support
      087a785 [Ye Xianjin] reverse the Networkinterface.getNetworkInterfaces output order to get a more proper local ip address.
      febafefa
    • Prashant Sharma's avatar
      [SPARK-3433][BUILD] Fix for Mima false-positives with @DeveloperAPI and @Experimental annotations. · ecf0c029
      Prashant Sharma authored
      Actually false positive reported was due to mima generator not picking up the new jars in presence of old jars(theoretically this should not have happened.). So as a workaround, ran them both separately and just append them together.
      
      Author: Prashant Sharma <prashant@apache.org>
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #2285 from ScrapCodes/mima-fix and squashes the following commits:
      
      093c76f [Prashant Sharma] Update mima
      59012a8 [Prashant Sharma] Update mima
      35b6c71 [Prashant Sharma] SPARK-3433 Fix for Mima false-positives with @DeveloperAPI and @Experimental annotations.
      ecf0c029
    • Reynold Xin's avatar
      [SPARK-3540] Add reboot-slaves functionality to the ec2 script · d428ac6a
      Reynold Xin authored
      Tested on a real cluster.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2404 from rxin/ec2-reboot-slaves and squashes the following commits:
      
      00a2dbd [Reynold Xin] Allow rebooting slaves.
      d428ac6a
    • Aaron Staple's avatar
      [SPARK-1087] Move python traceback utilities into new traceback_utils.py file. · 60050f42
      Aaron Staple authored
      Also made some cosmetic cleanups.
      
      Author: Aaron Staple <aaron.staple@gmail.com>
      
      Closes #2385 from staple/SPARK-1087 and squashes the following commits:
      
      7b3bb13 [Aaron Staple] Address review comments, cosmetic cleanups.
      10ba6e1 [Aaron Staple] [SPARK-1087] Move python traceback utilities into new traceback_utils.py file.
      60050f42
    • Davies Liu's avatar
      [SPARK-2951] [PySpark] support unpickle array.array for Python 2.6 · da33acb8
      Davies Liu authored
      Pyrolite can not unpickle array.array which pickled by Python 2.6, this patch fix it by extend Pyrolite.
      
      There is a bug in Pyrolite when unpickle array of float/double, this patch workaround it by reverse the endianness for float/double. This workaround should be removed after Pyrolite have a new release to fix this issue.
      
      I had send an PR to Pyrolite to fix it:  https://github.com/irmen/Pyrolite/pull/11
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2365 from davies/pickle and squashes the following commits:
      
      f44f771 [Davies Liu] enable tests about array
      3908f5c [Davies Liu] Merge branch 'master' into pickle
      c77c87b [Davies Liu] cleanup debugging code
      60e4e2f [Davies Liu] support unpickle array.array for Python 2.6
      da33acb8
    • qiping.lqp's avatar
      [SPARK-3516] [mllib] DecisionTree: Add minInstancesPerNode, minInfoGain params... · fdb302f4
      qiping.lqp authored
      [SPARK-3516] [mllib] DecisionTree: Add minInstancesPerNode, minInfoGain params to example and Python API
      
      Added minInstancesPerNode, minInfoGain params to:
      * DecisionTreeRunner.scala example
      * Python API (tree.py)
      
      Also:
      * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements"
      * small style fixes
      
      CC: mengxr
      
      Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      Author: chouqin <liqiping1991@gmail.com>
      
      Closes #2349 from jkbradley/chouqin-dt-preprune and squashes the following commits:
      
      61b2e72 [Joseph K. Bradley] Added max of 10GB for maxMemoryInMB in Strategy.
      a95e7c8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
      95c479d [Joseph K. Bradley] * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements" * small style fixes
      e2628b6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
      19b01af [Joseph K. Bradley] Merge remote-tracking branch 'chouqin/dt-preprune' into chouqin-dt-preprune
      f1d11d1 [chouqin] fix typo
      c7ebaf1 [chouqin] fix typo
      39f9b60 [chouqin] change edge `minInstancesPerNode` to 2 and add one more test
      c6e2dfc [Joseph K. Bradley] Added minInstancesPerNode and minInfoGain parameters to DecisionTreeRunner.scala and to Python API in tree.py
      0278a11 [chouqin] remove `noSplit` and set `Predict` private to tree
      d593ec7 [chouqin] fix docs and change minInstancesPerNode to 1
      efcc736 [qiping.lqp] fix bug
      10b8012 [qiping.lqp] fix style
      6728fad [qiping.lqp] minor fix: remove empty lines
      bb465ca [qiping.lqp] Merge branch 'master' of https://github.com/apache/spark into dt-preprune
      cadd569 [qiping.lqp] add api docs
      46b891f [qiping.lqp] fix bug
      e72c7e4 [qiping.lqp] add comments
      845c6fa [qiping.lqp] fix style
      f195e83 [qiping.lqp] fix style
      987cbf4 [qiping.lqp] fix bug
      ff34845 [qiping.lqp] separate calculation of predict of node from calculation of info gain
      ac42378 [qiping.lqp] add min info gain and min instances per node parameters in decision tree
      fdb302f4
    • Reza Zadeh's avatar
      [MLlib] Update SVD documentation in IndexedRowMatrix · 983d6a9c
      Reza Zadeh authored
      Updating this to reflect the newest SVD via ARPACK
      
      Author: Reza Zadeh <rizlar@gmail.com>
      
      Closes #2389 from rezazadeh/irmdocs and squashes the following commits:
      
      7fa1313 [Reza Zadeh] Update svd docs
      715da25 [Reza Zadeh] Updated computeSVD documentation IndexedRowMatrix
      983d6a9c
    • Christoph Sawade's avatar
      [SPARK-3396][MLLIB] Use SquaredL2Updater in LogisticRegressionWithSGD · 3b931281
      Christoph Sawade authored
      SimpleUpdater ignores the regularizer, which leads to an unregularized
      LogReg. To enable the common L2 regularizer (and the corresponding
      regularization parameter) for logistic regression the SquaredL2Updater
      has to be used in SGD (see, e.g., [SVMWithSGD])
      
      Author: Christoph Sawade <christoph@sawade.me>
      
      Closes #2398 from BigCrunsh/fix-regparam-logreg and squashes the following commits:
      
      0820c04 [Christoph Sawade] Use SquaredL2Updater in LogisticRegressionWithSGD
      3b931281
    • yantangzhai's avatar
      [SPARK-2714] DAGScheduler logs jobid when runJob finishes · 37d92528
      yantangzhai authored
      DAGScheduler logs jobid when runJob finishes
      
      Author: yantangzhai <tyz0303@163.com>
      
      Closes #1617 from YanTangZhai/SPARK-2714 and squashes the following commits:
      
      0a0243f [yantangzhai] [SPARK-2714] DAGScheduler logs jobid when runJob finishes
      fbb1150 [yantangzhai] [SPARK-2714] DAGScheduler logs jobid when runJob finishes
      7aec2a9 [yantangzhai] [SPARK-2714] DAGScheduler logs jobid when runJob finishes
      fb42f0f [yantangzhai] [SPARK-2714] DAGScheduler logs jobid when runJob finishes
      090d908 [yantangzhai] [SPARK-2714] DAGScheduler logs jobid when runJob finishes
      37d92528
    • Kousuke Saruta's avatar
      [SPARK-3518] Remove wasted statement in JsonProtocol · e59fac1f
      Kousuke Saruta authored
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2380 from sarutak/SPARK-3518 and squashes the following commits:
      
      8a1464e [Kousuke Saruta] Replaced a variable with simple field reference
      c660fbc [Kousuke Saruta] Removed useless statement in JsonProtocol.scala
      e59fac1f
    • Matthew Farrellee's avatar
      [SPARK-3425] do not set MaxPermSize for OpenJDK 1.8 · fe2b1d6a
      Matthew Farrellee authored
      Closes #2387
      
      Author: Matthew Farrellee <matt@redhat.com>
      
      Closes #2301 from mattf/SPARK-3425 and squashes the following commits:
      
      20f3c09 [Matthew Farrellee] [SPARK-3425] do not set MaxPermSize for OpenJDK 1.8
      fe2b1d6a
    • Kousuke Saruta's avatar
      [SPARK-3410] The priority of shutdownhook for ApplicationMaster should not be integer literal · cc146444
      Kousuke Saruta authored
      I think, it need to keep the priority of shutdown hook for ApplicationMaster than the priority of shutdown hook for o.a.h.FileSystem depending on changing the priority for FileSystem.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2283 from sarutak/SPARK-3410 and squashes the following commits:
      
      1d44fef [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3410
      bd6cc53 [Kousuke Saruta] Modified style
      ee6f1aa [Kousuke Saruta] Added constant "SHUTDOWN_HOOK_PRIORITY" to ApplicationMaster
      54eb68f [Kousuke Saruta] Changed Shutdown hook priority to 20
      2f0aee3 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3410
      4c5cb93 [Kousuke Saruta] Modified the priority for AM's shutdown hook
      217d1a4 [Kousuke Saruta] Removed unused import statements
      717aba2 [Kousuke Saruta] Modified ApplicationMaster to make to keep the priority of shutdown hook for ApplicationMaster higher than the priority of shutdown hook for HDFS
      cc146444
  4. Sep 14, 2014
    • Prashant Sharma's avatar
      [SPARK-3452] Maven build should skip publishing artifacts people shouldn... · f493f798
      Prashant Sharma authored
      ...'t depend on
      
      Publish local in maven term is `install`
      
      and publish otherwise is `deploy`
      
      So disabled both for following projects.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #2329 from ScrapCodes/SPARK-3452/maven-skip-install and squashes the following commits:
      
      257b79a [Prashant Sharma] [SPARK-3452] Maven build should skip publishing artifacts people shouldn't depend on
      f493f798
    • Bertrand Bossy's avatar
      SPARK-3039: Allow spark to be built using avro-mapred for hadoop2 · c243b21a
      Bertrand Bossy authored
      SPARK-3039: Adds the maven property "avro.mapred.classifier" to build spark-assembly with avro-mapred with support for the new Hadoop API. Sets this property to hadoop2 for Hadoop 2 profiles.
      
      I am not very familiar with maven, nor do I know whether this potentially breaks something in the hive part of spark. There might be a more elegant way of doing this.
      
      Author: Bertrand Bossy <bertrandbossy@gmail.com>
      
      Closes #1945 from bbossy/SPARK-3039 and squashes the following commits:
      
      c32ce59 [Bertrand Bossy] SPARK-3039: Allow spark to be built using avro-mapred for hadoop2
      c243b21a
    • Davies Liu's avatar
      [SPARK-3463] [PySpark] aggregate and show spilled bytes in Python · 4e3fbe8c
      Davies Liu authored
      Aggregate the number of bytes spilled into disks during aggregation or sorting, show them in Web UI.
      
      ![spilled](https://cloud.githubusercontent.com/assets/40902/4209758/4b995562-386d-11e4-97c1-8e838ee1d4e3.png)
      
      This patch is blocked by SPARK-3465. (It includes a fix for that).
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2336 from davies/metrics and squashes the following commits:
      
      e37df38 [Davies Liu] remove outdated comments
      1245eb7 [Davies Liu] remove the temporary fix
      ebd2f43 [Davies Liu] Merge branch 'master' into metrics
      7e4ad04 [Davies Liu] Merge branch 'master' into metrics
      fbe9029 [Davies Liu] show spilled bytes in Python in web ui
      4e3fbe8c
  5. Sep 13, 2014
    • Davies Liu's avatar
      [SPARK-3030] [PySpark] Reuse Python worker · 2aea0da8
      Davies Liu authored
      Reuse Python worker to avoid the overhead of fork() Python process for each tasks. It also tracks the broadcasts for each worker, avoid sending repeated broadcasts.
      
      This can reduce the time for dummy task from 22ms to 13ms (-40%). It can help to reduce the latency for Spark Streaming.
      
      For a job with broadcast (43M after compress):
      ```
          b = sc.broadcast(set(range(30000000)))
          print sc.parallelize(range(24000), 100).filter(lambda x: x in b.value).count()
      ```
      It will finish in 281s without reused worker, and it will finish in 65s with reused worker(4 CPUs). After reusing the worker, it can save about 9 seconds for transfer and deserialize the broadcast for each tasks.
      
      It's enabled by default, could be disabled by `spark.python.worker.reuse = false`.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2259 from davies/reuse-worker and squashes the following commits:
      
      f11f617 [Davies Liu] Merge branch 'master' into reuse-worker
      3939f20 [Davies Liu] fix bug in serializer in mllib
      cf1c55e [Davies Liu] address comments
      3133a60 [Davies Liu] fix accumulator with reused worker
      760ab1f [Davies Liu] do not reuse worker if there are any exceptions
      7abb224 [Davies Liu] refactor: sychronized with itself
      ac3206e [Davies Liu] renaming
      8911f44 [Davies Liu] synchronized getWorkerBroadcasts()
      6325fc1 [Davies Liu] bugfix: bid >= 0
      e0131a2 [Davies Liu] fix name of config
      583716e [Davies Liu] only reuse completed and not interrupted worker
      ace2917 [Davies Liu] kill python worker after timeout
      6123d0f [Davies Liu] track broadcasts for each worker
      8d2f08c [Davies Liu] reuse python worker
      2aea0da8
    • Michael Armbrust's avatar
      [SQL] Decrease partitions when testing · 0f8c4edf
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2164 from marmbrus/shufflePartitions and squashes the following commits:
      
      0da1e8c [Michael Armbrust] test hax
      ef2d985 [Michael Armbrust] more test hacks.
      2dabae3 [Michael Armbrust] more test fixes
      0bdbf21 [Michael Armbrust] Make parquet tests less order dependent
      b42eeab [Michael Armbrust] increase test parallelism
      80453d5 [Michael Armbrust] Decrease partitions when testing
      0f8c4edf
    • Cheng Lian's avatar
      [SPARK-3294][SQL] Eliminates boxing costs from in-memory columnar storage · 74049249
      Cheng Lian authored
      This is a major refactoring of the in-memory columnar storage implementation, aims to eliminate boxing costs from critical paths (building/accessing column buffers) as much as possible. The basic idea is to refactor all major interfaces into a row-based form and use them together with `SpecificMutableRow`. The difficult part is how to adapt all compression schemes, esp. `RunLengthEncoding` and `DictionaryEncoding`, to this design. Since in-memory compression is disabled by default for now, and this PR should be strictly better than before no matter in-memory compression is enabled or not, maybe I'll finish that part in another PR.
      
      **UPDATE** This PR also took the chance to optimize `HiveTableScan` by
      
      1. leveraging `SpecificMutableRow` to avoid boxing cost, and
      1. building specific `Writable` unwrapper functions a head of time to avoid per row pattern matching and branching costs.
      
      TODO
      
      - [x] Benchmark
      - [ ] ~~Eliminate boxing costs in `RunLengthEncoding`~~ (left to future PRs)
      - [ ] ~~Eliminate boxing costs in `DictionaryEncoding` (seems not easy to do without specializing `DictionaryEncoding` for every supported column type)~~  (left to future PRs)
      
      ## Micro benchmark
      
      The benchmark uses a 10 million line CSV table consists of bytes, shorts, integers, longs, floats and doubles, measures the time to build the in-memory version of this table, and the time to scan the whole in-memory table.
      
      Benchmark code can be found [here](https://gist.github.com/liancheng/fe70a148de82e77bd2c8#file-hivetablescanbenchmark-scala). Script used to generate the input table can be found [here](https://gist.github.com/liancheng/fe70a148de82e77bd2c8#file-tablegen-scala).
      
      Speedup:
      
      - Hive table scanning + column buffer building: **18.74%**
      
        The original benchmark uses 1K as in-memory batch size, when increased to 10K, it can be 28.32% faster.
      
      - In-memory table scanning: **7.95%**
      
      Before:
      
              | Building | Scanning
      ------- | -------- | --------
      1       | 16472    | 525
      2       | 16168    | 530
      3       | 16386    | 529
      4       | 16184    | 538
      5       | 16209    | 521
      Average | 16283.8  | 528.6
      
      After:
      
              | Building | Scanning
      ------- | -------- | --------
      1       | 13124    | 458
      2       | 13260    | 529
      3       | 12981    | 463
      4       | 13214    | 483
      5       | 13583    | 500
      Average | 13232.4  | 486.6
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2327 from liancheng/prevent-boxing/unboxing and squashes the following commits:
      
      4419fe4 [Cheng Lian] Addressing comments
      e5d2cf2 [Cheng Lian] Bug fix: should call setNullAt when field value is null to avoid NPE
      8b8552b [Cheng Lian] Only checks for partition batch pruning flag once
      489f97b [Cheng Lian] Bug fix: TableReader.fillObject uses wrong ordinals
      97bbc4e [Cheng Lian] Optimizes hive.TableReader by by providing specific Writable unwrappers a head of time
      3dc1f94 [Cheng Lian] Minor changes to eliminate row object creation
      5b39cb9 [Cheng Lian] Lowers log level of compression scheme details
      f2a7890 [Cheng Lian] Use SpecificMutableRow in InMemoryColumnarTableScan to avoid boxing
      9cf30b0 [Cheng Lian] Added row based ColumnType.append/extract
      456c366 [Cheng Lian] Made compression decoder row based
      edac3cd [Cheng Lian] Makes ColumnAccessor.extractSingle row based
      8216936 [Cheng Lian] Removes boxing cost in IntDelta and LongDelta by providing specialized implementations
      b70d519 [Cheng Lian] Made some in-memory columnar storage interfaces row-based
      74049249
    • Cheng Lian's avatar
      [SPARK-3481][SQL] Removes the evil MINOR HACK · 184cd51c
      Cheng Lian authored
      This is a follow up of #2352. Now we can finally remove the evil "MINOR HACK", which covered up the eldest bug in the history of Spark SQL (see details [here](https://github.com/apache/spark/pull/2352#issuecomment-55440621)).
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2377 from liancheng/remove-evil-minor-hack and squashes the following commits:
      
      0869c78 [Cheng Lian] Removes the evil MINOR HACK
      184cd51c
    • Nicholas Chammas's avatar
      [SQL] [Docs] typo fixes · a523ceaf
      Nicholas Chammas authored
      * Fixed random typo
      * Added in missing description for DecimalType
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #2367 from nchammas/patch-1 and squashes the following commits:
      
      aa528be [Nicholas Chammas] doc fix for SQL DecimalType
      3247ac1 [Nicholas Chammas] [SQL] [Docs] typo fixes
      a523ceaf
    • Reynold Xin's avatar
      Proper indent for the previous commit. · b4dded40
      Reynold Xin authored
      b4dded40
    • Sean Owen's avatar
      SPARK-3470 [CORE] [STREAMING] Add Closeable / close() to Java context objects · feaa3706
      Sean Owen authored
      ...  that expose a stop() lifecycle method. This doesn't add `AutoCloseable`, which is Java 7+ only. But it should be possible to use try-with-resources on a `Closeable` in Java 7, as long as the `close()` does not throw a checked exception, and these don't. Q.E.D.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2346 from srowen/SPARK-3470 and squashes the following commits:
      
      612c21d [Sean Owen] Add Closeable / close() to Java context objects that expose a stop() lifecycle method
      feaa3706
Loading