Skip to content
Snippets Groups Projects
  1. Aug 24, 2014
  2. Aug 23, 2014
    • Raymond Liu's avatar
      Clean unused code in SortShuffleWriter · 8861cdf1
      Raymond Liu authored
      Just clean unused code which have been moved into ExternalSorter.
      
      Author: Raymond Liu <raymond.liu@intel.com>
      
      Closes #1882 from colorant/sortShuffleWriter and squashes the following commits:
      
      e6337be [Raymond Liu] Clean unused code in SortShuffleWriter
      8861cdf1
    • Davies Liu's avatar
      [SPARK-2871] [PySpark] add approx API for RDD · 8df4dad4
      Davies Liu authored
      RDD.countApprox(self, timeout, confidence=0.95)
      
              :: Experimental ::
              Approximate version of count() that returns a potentially incomplete
              result within a timeout, even if not all tasks have finished.
      
              >>> rdd = sc.parallelize(range(1000), 10)
              >>> rdd.countApprox(1000, 1.0)
              1000
      
      RDD.sumApprox(self, timeout, confidence=0.95)
      
              Approximate operation to return the sum within a timeout
              or meet the confidence.
      
              >>> rdd = sc.parallelize(range(1000), 10)
              >>> r = sum(xrange(1000))
              >>> (rdd.sumApprox(1000) - r) / r < 0.05
      
      RDD.meanApprox(self, timeout, confidence=0.95)
      
              :: Experimental ::
              Approximate operation to return the mean within a timeout
              or meet the confidence.
      
              >>> rdd = sc.parallelize(range(1000), 10)
              >>> r = sum(xrange(1000)) / 1000.0
              >>> (rdd.meanApprox(1000) - r) / r < 0.05
              True
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2095 from davies/approx and squashes the following commits:
      
      e8c252b [Davies Liu] add approx API for RDD
      8df4dad4
    • Davies Liu's avatar
      [SPARK-2871] [PySpark] add `key` argument for max(), min() and top(n) · db436e36
      Davies Liu authored
      RDD.max(key=None)
      
              param key: A function used to generate key for comparing
      
              >>> rdd = sc.parallelize([1.0, 5.0, 43.0, 10.0])
              >>> rdd.max()
              43.0
              >>> rdd.max(key=str)
              5.0
      
      RDD.min(key=None)
      
              Find the minimum item in this RDD.
      
              param key: A function used to generate key for comparing
      
              >>> rdd = sc.parallelize([2.0, 5.0, 43.0, 10.0])
              >>> rdd.min()
              2.0
              >>> rdd.min(key=str)
              10.0
      
      RDD.top(num, key=None)
      
              Get the top N elements from a RDD.
      
              Note: It returns the list sorted in descending order.
              >>> sc.parallelize([10, 4, 2, 12, 3]).top(1)
              [12]
              >>> sc.parallelize([2, 3, 4, 5, 6], 2).top(2)
              [6, 5]
              >>> sc.parallelize([10, 4, 2, 12, 3]).top(3, key=str)
              [4, 3, 2]
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2094 from davies/cmp and squashes the following commits:
      
      ccbaf25 [Davies Liu] add `key` to top()
      ad7e374 [Davies Liu] fix tests
      2f63512 [Davies Liu] change `comp` to `key` in min/max
      dd91e08 [Davies Liu] add `comp` argument for RDD.max() and RDD.min()
      db436e36
    • Michael Armbrust's avatar
      [SPARK-2967][SQL] Follow-up: Also copy hash expressions in sort based shuffle fix. · 3519b5e8
      Michael Armbrust authored
      Follow-up to #2066
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2072 from marmbrus/sortShuffle and squashes the following commits:
      
      2ff8114 [Michael Armbrust] Fix bug
      3519b5e8
    • Michael Armbrust's avatar
      [SPARK-2554][SQL] CountDistinct partial aggregation and object allocation improvements · 7e191fe2
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      Author: Gregory Owen <greowen@gmail.com>
      
      Closes #1935 from marmbrus/countDistinctPartial and squashes the following commits:
      
      5c7848d [Michael Armbrust] turn off caching in the constructor
      8074a80 [Michael Armbrust] fix tests
      32d216f [Michael Armbrust] reynolds comments
      c122cca [Michael Armbrust] Address comments, add tests
      b2e8ef3 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
      fae38f4 [Michael Armbrust] Fix style
      fdca896 [Michael Armbrust] cleanup
      93d0f64 [Michael Armbrust] metastore concurrency fix.
      db44a30 [Michael Armbrust] JIT hax.
      3868f6c [Michael Armbrust] Merge pull request #9 from GregOwen/countDistinctPartial
      c9e67de [Gregory Owen] Made SpecificRow and types serializable by Kryo
      2b46c4b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
      8ff6402 [Michael Armbrust] Add specific row.
      58d15f1 [Michael Armbrust] disable codegen logging
      87d101d [Michael Armbrust] Fix isNullAt bug
      abee26d [Michael Armbrust] WIP
      27984d0 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
      57ae3b1 [Michael Armbrust] Fix order dependent test
      b3d0f64 [Michael Armbrust] Add golden files.
      c1f7114 [Michael Armbrust] Improve tests / fix serialization.
      f31b8ad [Michael Armbrust] more fixes
      38c7449 [Michael Armbrust] comments and style
      9153652 [Michael Armbrust] better toString
      d494598 [Michael Armbrust] Fix tests now that the planner is better
      41fbd1d [Michael Armbrust] Never try and create an empty hash set.
      050bb97 [Michael Armbrust] Skip no-arg constructors for kryo,
      bd08239 [Michael Armbrust] WIP
      213ada8 [Michael Armbrust] First draft of partially aggregated and code generated count distinct / max
      7e191fe2
    • Yin Huai's avatar
      [SQL] Make functionRegistry in HiveContext transient. · 2fb1c72e
      Yin Huai authored
      Seems we missed `transient` for the `functionRegistry` in `HiveContext`.
      
      cc: marmbrus
      
      Author: Yin Huai <huaiyin.thu@gmail.com>
      
      Closes #2074 from yhuai/makeFunctionRegistryTransient and squashes the following commits:
      
      6534e7d [Yin Huai] Make functionRegistry transient.
      2fb1c72e
    • Liang-Chi Hsieh's avatar
      [Minor] fix typo · 76bb044b
      Liang-Chi Hsieh authored
      Fix a typo in comment.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #2105 from viirya/fix_typo and squashes the following commits:
      
      6596a80 [Liang-Chi Hsieh] fix typo.
      76bb044b
    • Daoyuan Wang's avatar
      [SPARK-3068]remove MaxPermSize option for jvm 1.8 · f3d65cd0
      Daoyuan Wang authored
      In JVM 1.8.0, MaxPermSize is no longer supported.
      In spark `stderr` output, there would be a line of
      
          Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #2011 from adrian-wang/maxpermsize and squashes the following commits:
      
      ef1d660 [Daoyuan Wang] direct get java version in runtime
      37db9c1 [Daoyuan Wang] code refine
      3c1d554 [Daoyuan Wang] remove MaxPermSize option for jvm 1.8
      f3d65cd0
    • Kousuke Saruta's avatar
      [SPARK-2963] REGRESSION - The description about how to build for using CLI and... · 323cd92b
      Kousuke Saruta authored
      [SPARK-2963] REGRESSION - The description about how to build for using CLI and Thrift JDBC server is absent in proper document  -
      
      The most important things I mentioned in #1885 is as follows.
      
      * People who build Spark is not always programmer.
      * If a person who build Spark is not a programmer, he/she won't read programmer's guide before building.
      
      So, how to build for using CLI and JDBC server is not only in programmer's guide.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2080 from sarutak/SPARK-2963 and squashes the following commits:
      
      ee07c76 [Kousuke Saruta] Modified regression of the description about building for using Thrift JDBC server and CLI
      ed53329 [Kousuke Saruta] Modified description and notaton of proper noun
      07c59fc [Kousuke Saruta] Added a description about how to build to use HiveServer and CLI for SparkSQL to building-with-maven.md
      6e6645a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2963
      c88fa93 [Kousuke Saruta] Added a description about building to use HiveServer and CLI for SparkSQL
      323cd92b
  3. Aug 22, 2014
    • Tathagata Das's avatar
      [SPARK-3169] Removed dependency on spark streaming test from spark flume sink · 30040741
      Tathagata Das authored
      Due to maven bug https://jira.codehaus.org/browse/MNG-1378, maven could not resolve spark streaming classes required by the spark-streaming test-jar dependency of external/flume-sink. There is no particular reason that the external/flume-sink has to depend on Spark Streaming at all, so I am eliminating this dependency. Also I have removed the exclusions present in the Flume dependencies, as there is no reason to exclude them (they were excluded in the external/flume module to prevent dependency collisions with Spark).
      
      Since Jenkins will test the sbt build and the unit test, I only tested maven compilation locally.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #2101 from tdas/spark-sink-pom-fix and squashes the following commits:
      
      8f42621 [Tathagata Das] Added Flume sink exclusions back, and added netty to test dependencies
      93b559f [Tathagata Das] Removed dependency on spark streaming test from spark flume sink
      30040741
    • Reynold Xin's avatar
      a5219db1
    • XuTingjun's avatar
      [SPARK-2742][yarn] delete useless variables · 220c2d76
      XuTingjun authored
      Author: XuTingjun <1039320815@qq.com>
      
      Closes #1614 from XuTingjun/yarn-bug and squashes the following commits:
      
      f07096e [XuTingjun] Update ClientArguments.scala
      220c2d76
  4. Aug 21, 2014
    • Joseph K. Bradley's avatar
      [SPARK-2840] [mllib] DecisionTree doc update (Java, Python examples) · 050f8d01
      Joseph K. Bradley authored
      Updated DecisionTree documentation, with examples for Java, Python.
      Added same Java example to code as well.
      CC: @mengxr  @manishamde @atalwalkar
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #2063 from jkbradley/dt-docs and squashes the following commits:
      
      2dd2c19 [Joseph K. Bradley] Last updates based on github review.
      9dd1b6b [Joseph K. Bradley] Updated decision tree doc.
      d802369 [Joseph K. Bradley] Updates based on comments: cache data, corrected doc text.
      b9bee04 [Joseph K. Bradley] Updated DT examples
      57eee9f [Joseph K. Bradley] Created JavaDecisionTree example from example in docs, and corrected doc example as needed.
      d939a92 [Joseph K. Bradley] Updated DecisionTree documentation.  Added Java, Python examples.
      050f8d01
  5. Aug 20, 2014
    • Xiangrui Meng's avatar
      [SPARK-2843][MLLIB] add a section about regularization parameter in ALS · e0f94626
      Xiangrui Meng authored
      atalwalkar srowen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2064 from mengxr/als-doc and squashes the following commits:
      
      b2e20ab [Xiangrui Meng] introduced -> discussed
      98abdd7 [Xiangrui Meng] add reference
      339bd08 [Xiangrui Meng] add a section about regularization parameter in ALS
      e0f94626
    • Xiangrui Meng's avatar
      [SPARK-3143][MLLIB] add tf-idf user guide · e1571874
      Xiangrui Meng authored
      Moved TF-IDF before Word2Vec because the former is more basic. I also added a link for Word2Vec. atalwalkar
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2061 from mengxr/tfidf-doc and squashes the following commits:
      
      ca04c70 [Xiangrui Meng] address comments
      a5ea4b4 [Xiangrui Meng] add tf-idf user guide
      e1571874
    • Andrew Or's avatar
      [SPARK-3140] Clarify confusing PySpark exception message · ba3c730e
      Andrew Or authored
      We read the py4j port from the stdout of the `bin/spark-submit` subprocess. If there is interference in stdout (e.g. a random echo in `spark-submit`), we throw an exception with a warning message. We do not, however, distinguish between this case from the case where no stdout is produced at all.
      
      I wasted a non-trivial amount of time being baffled by this exception in search of places where I print random whitespace (in vain, of course). A clearer exception message that distinguishes between these cases will prevent similar headaches that I have gone through.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #2067 from andrewor14/python-exception and squashes the following commits:
      
      742f823 [Andrew Or] Further clarify warning messages
      e96a7a0 [Andrew Or] Distinguish between unexpected output and no output at all
      ba3c730e
    • Marcelo Vanzin's avatar
      [SPARK-2848] Shade Guava in uber-jars. · c9f74395
      Marcelo Vanzin authored
      For further discussion, please check the JIRA entry.
      
      This change moves Guava classes to a different package so that they don't conflict with the user-provided Guava (or the Hadoop-provided one). Since one class (Optional) was exposed through Spark's public API, that class was forked from Guava at the current dependency version (14.0.1) so that it can be kept going forward (until the API is cleaned).
      
      Note this change has a few implications:
      - *all* classes in the final jars will reference the relocated classes. If Hadoop classes are included (i.e. "-Phadoop-provided" is not activated), those will also reference the Guava 14 classes (instead of the Guava 11 classes from the Hadoop classpath).
      - if the Guava version in Spark is ever changed, the new Guava will still reference the forked Optional class; this may or may not be a problem, but in the long term it's better to think about removing Optional from the public API.
      
      For the end user, there are two visible implications:
      
      - Guava is not provided as a transitive dependency anymore (since it's "provided" in Spark)
      - At runtime, unless they provide their own, they'll either have no Guava or Hadoop's version of Guava (11), depending on how they set up their classpath.
      
      Note that this patch does not change the sbt deliverables; those will still contain guava in its original package, and provide guava as a compile-time dependency. This assumes that maven is the canonical build, and sbt-built artifacts are not (officially) published.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #1813 from vanzin/SPARK-2848 and squashes the following commits:
      
      9bdffb0 [Marcelo Vanzin] Undo sbt build changes.
      819b445 [Marcelo Vanzin] Review feedback.
      05e0a3d [Marcelo Vanzin] Merge branch 'master' into SPARK-2848
      fef4370 [Marcelo Vanzin] Unfork Optional.java.
      d3ea8e1 [Marcelo Vanzin] Exclude asm classes from final jar.
      637189b [Marcelo Vanzin] Add hacky filter to prefer Spark's copy of Optional.
      2fec990 [Marcelo Vanzin] Shade Guava in the sbt build.
      616998e [Marcelo Vanzin] Shade Guava in the maven build, fork Guava's Optional.java.
      c9f74395
    • Alex Liu's avatar
      [SPARK-2846][SQL] Add configureInputJobPropertiesForStorageHandler to initialization of job conf · d9e94146
      Alex Liu authored
      ...al job conf
      
      Author: Alex Liu <alex_liu68@yahoo.com>
      
      Closes #1927 from alexliu68/SPARK-SQL-2846 and squashes the following commits:
      
      e4bdc4c [Alex Liu] SPARK-SQL-2846 add configureInputJobPropertiesForStorageHandler to initial job conf
      d9e94146
    • wangfei's avatar
      SPARK_LOGFILE and SPARK_ROOT_LOGGER no longer need in spark-daemon.sh · a1e8b1bc
      wangfei authored
      Author: wangfei <wangfei_hello@126.com>
      
      Closes #2057 from scwf/patch-7 and squashes the following commits:
      
      1b7b9a5 [wangfei] SPARK_LOGFILE and SPARK_ROOT_LOGGER no longer need in spark-daemon.sh
      a1e8b1bc
    • Michael Armbrust's avatar
      [SPARK-2967][SQL] Fix sort based shuffle for spark sql. · a2e658dc
      Michael Armbrust authored
      Add explicit row copies when sort based shuffle is on.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2066 from marmbrus/sortShuffle and squashes the following commits:
      
      fcd7bb2 [Michael Armbrust] Fix sort based shuffle for spark sql.
      a2e658dc
    • Reynold Xin's avatar
      [SPARK-2298] Encode stage attempt in SparkListener & UI. · fb60bec3
      Reynold Xin authored
      Simple way to reproduce this in the UI:
      
      ```scala
      val f = new java.io.File("/tmp/test")
      f.delete()
      sc.parallelize(1 to 2, 2).map(x => (x,x )).repartition(3).mapPartitionsWithContext { case (context, iter) =>
        if (context.partitionId == 0) {
          val f = new java.io.File("/tmp/test")
          if (!f.exists) {
            f.mkdir()
            System.exit(0);
          }
        }
        iter
      }.count()
      ```
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1545 from rxin/stage-attempt and squashes the following commits:
      
      3ee1d2a [Reynold Xin] - Rename attempt to retry in UI. - Properly report stage failure in FetchFailed.
      40a6bd5 [Reynold Xin] Updated test suites.
      c414c36 [Reynold Xin] Fixed the hanging in JobCancellationSuite.
      b3e2eed [Reynold Xin] Oops previous code didn't compile.
      0f36075 [Reynold Xin] Mark unknown stage attempt with id -1 and drop that in JobProgressListener.
      6c08b07 [Reynold Xin] Addressed code review feedback.
      4e5faa2 [Reynold Xin] [SPARK-2298] Encode stage attempt in SparkListener & UI.
      fb60bec3
    • Andrew Or's avatar
      [SPARK-2849] Handle driver configs separately in client mode · b3ec51bf
      Andrew Or authored
      In client deploy mode, the driver is launched from within `SparkSubmit`'s JVM. This means by the time we parse Spark configs from `spark-defaults.conf`, it is already too late to control certain properties of the driver's JVM. We currently ignore these configs in client mode altogether.
      ```
      spark.driver.memory
      spark.driver.extraJavaOptions
      spark.driver.extraClassPath
      spark.driver.extraLibraryPath
      ```
      This PR handles these properties before launching the driver JVM. It achieves this by spawning a separate JVM that runs a new class called `SparkSubmitDriverBootstrapper`, which spawns `SparkSubmit` as a sub-process with the appropriate classpath, library paths, java opts and memory.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1845 from andrewor14/handle-configs-bash and squashes the following commits:
      
      bed4bdf [Andrew Or] Change a few comments / messages (minor)
      24dba60 [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
      08fd788 [Andrew Or] Warn against external usages of SparkSubmitDriverBootstrapper
      ff34728 [Andrew Or] Minor comments
      51aeb01 [Andrew Or] Filter out JVM memory in Scala rather than Bash (minor)
      9a778f6 [Andrew Or] Fix PySpark: actually kill driver on termination
      d0f20db [Andrew Or] Don't pass empty library paths, classpath, java opts etc.
      a78cb26 [Andrew Or] Revert a few changes in utils.sh (minor)
      9ba37e2 [Andrew Or] Don't barf when the properties file does not exist
      8867a09 [Andrew Or] A few more naming things (minor)
      19464ad [Andrew Or] SPARK_SUBMIT_JAVA_OPTS -> SPARK_SUBMIT_OPTS
      d6488f9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
      1ea6bbe [Andrew Or] SparkClassLauncher -> SparkSubmitDriverBootstrapper
      a91ea19 [Andrew Or] Fix precedence of library paths, classpath, java opts and memory
      158f813 [Andrew Or] Remove "client mode" boolean argument
      c84f5c8 [Andrew Or] Remove debug print statement (minor)
      b71f52b [Andrew Or] Revert a few more changes (minor)
      7d94a8d [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
      3a8235d [Andrew Or] Only parse the properties file if special configs exist
      c37e08d [Andrew Or] Revert a few more changes
      a396eda [Andrew Or] Nullify my own hard work to simplify bash
      0effa1e [Andrew Or] Add code in Scala that handles special configs
      c886568 [Andrew Or] Fix lines too long + a few comments / style (minor)
      7a4190a [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
      7396be2 [Andrew Or] Explicitly comment that multi-line properties are not supported
      fa11ef8 [Andrew Or] Parse the properties file only if the special configs exist
      371cac4 [Andrew Or] Add function prefix (minor)
      be99eb3 [Andrew Or] Fix tests to not include multi-line configs
      bd0d468 [Andrew Or] Simplify parsing config file by ignoring multi-line arguments
      56ac247 [Andrew Or] Use eval and set to simplify splitting
      8d4614c [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
      aeb79c7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
      2732ac0 [Andrew Or] Integrate BASH tests into dev/run-tests + log error properly
      8d26a5c [Andrew Or] Add tests for bash/utils.sh
      4ae24c3 [Andrew Or] Fix bug: escape properly in quote_java_property
      b3c4cd5 [Andrew Or] Fix bug: count the number of quotes instead of detecting presence
      c2273fc [Andrew Or] Fix typo (minor)
      e793e5f [Andrew Or] Handle multi-line arguments
      5d8f8c4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-driver-extra
      c7b9926 [Andrew Or] Minor changes to spark-defaults.conf.template
      a992ae2 [Andrew Or] Escape spark.*.extraJavaOptions correctly
      aabfc7e [Andrew Or] escape -> split (minor)
      45a1eb9 [Andrew Or] Fix bug: escape escaped backslashes and quotes properly...
      1cdc6b1 [Andrew Or] Fix bug: escape escaped double quotes properly
      c854859 [Andrew Or] Add small comment
      c13a2cb [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-driver-extra
      8e552b7 [Andrew Or] Include an example of spark.*.extraJavaOptions
      de765c9 [Andrew Or] Print spark-class command properly
      a4df3c4 [Andrew Or] Move parsing and escaping logic to utils.sh
      dec2343 [Andrew Or] Only export variables if they exist
      fa2136e [Andrew Or] Escape Java options + parse java properties files properly
      ef12f74 [Andrew Or] Minor formatting
      4ec22a1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-driver-extra
      e5cfb46 [Andrew Or] Collapse duplicate code + fix potential whitespace issues
      4edcaa8 [Andrew Or] Redirect stdout to stderr for python
      130f295 [Andrew Or] Handle spark.driver.memory too
      98dd8e3 [Andrew Or] Add warning if properties file does not exist
      8843562 [Andrew Or] Fix compilation issues...
      75ee6b4 [Andrew Or] Remove accidentally added file
      63ed2e9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-driver-extra
      0025474 [Andrew Or] Revert SparkSubmit handling of --driver-* options for only cluster mode
      a2ab1b0 [Andrew Or] Parse spark.driver.extra* in bash
      250cb95 [Andrew Or] Do not ignore spark.driver.extra* for client mode
      b3ec51bf
    • Kousuke Saruta's avatar
      [SPARK-3149] Connection establishment information is not enough. · c1ba4cd6
      Kousuke Saruta authored
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2060 from sarutak/SPARK-3149 and squashes the following commits:
      
      1cc89af [Kousuke Saruta] Modified log message of accepting connection
      c1ba4cd6
    • Kousuke Saruta's avatar
      [SPARK-3062] [SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled · 0ea46ac8
      Kousuke Saruta authored
      #1891 was to avoid IOException when EventLogging is enabled.
      The solution used ShutdownHookManager but it was defined only Hadoop 2.x. Hadoop 1.x don't have ShutdownHookManager so #1891 doesn't compile on Hadoop 1.x
      
      Now, I had a compromised solution for both Hadoop 1.x and 2.x.
      Only for FileLogger, an unique FileSystem object is created.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #1970 from sarutak/SPARK-2970 and squashes the following commits:
      
      240c91e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2970
      0e7b45d [Kousuke Saruta] Revert "[SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled"
      e1262ec [Kousuke Saruta] Modified Filelogger to use unique FileSystem instance
      0ea46ac8
    • Cheng Lian's avatar
      [SPARK-3126][SPARK-3127][SQL] Fixed HiveThriftServer2Suite · cf46e725
      Cheng Lian authored
      This PR fixes two issues:
      
      1. Fixes wrongly quoted command line option in `HiveThriftServer2Suite` that makes test cases hang until timeout.
      1. Asks `dev/run-test` to run Spark SQL tests when `bin/spark-sql` and/or `sbin/start-thriftserver.sh` are modified.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2036 from liancheng/fix-thriftserver-test and squashes the following commits:
      
      f38c4eb [Cheng Lian] Fixed the same quotation issue in CliSuite
      26b82a0 [Cheng Lian] Run SQL tests when dff contains bin/spark-sql and/or sbin/start-thriftserver.sh
      a87f83d [Cheng Lian] Extended timeout
      e5aa31a [Cheng Lian] Fixed metastore JDBC URI quotation
      cf46e725
    • Patrick Wendell's avatar
      BUILD: Bump Hadoop versions in the release build. · ceb19830
      Patrick Wendell authored
      Also, minor modifications to the MapR profile.
      ceb19830
    • Patrick Wendell's avatar
      SPARK-3092 [SQL]: Always include the thriftserver when -Phive is enabled. · f2f26c2a
      Patrick Wendell authored
      Currently we have a separate profile called hive-thriftserver. I originally suggested this in case users did not want to bundle the thriftserver, but it's ultimately lead to a lot of confusion. Since the thriftserver is only a few classes, I don't see a really good reason to isolate it from the rest of Hive. So let's go ahead and just include it in the same profile to simplify things.
      
      This has been suggested in the past by liancheng.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #2006 from pwendell/hiveserver and squashes the following commits:
      
      742ea40 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into hiveserver
      034ad47 [Patrick Wendell] SPARK-3092: Always include the thriftserver when -Phive is enabled.
      f2f26c2a
    • Hari Shreedharan's avatar
      [SPARK-3054][STREAMING] Add unit tests for Spark Sink. · 8c5a2226
      Hari Shreedharan authored
      This patch adds unit tests for Spark Sink.
      
      It also removes the private[flume] for Spark Sink,
      since the sink is instantiated from Flume configuration (looks like this is ignored by reflection which is used by
      Flume, but we should still remove it anyway).
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      Author: Hari Shreedharan <hshreedharan@cloudera.com>
      
      Closes #1958 from harishreedharan/spark-sink-test and squashes the following commits:
      
      e3110b9 [Hari Shreedharan] Add a sleep to allow sink to commit the transactions
      120b81e [Hari Shreedharan] Fix complexity in threading model in test
      4df5be6 [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into spark-sink-test
      c9190d1 [Hari Shreedharan] Indentation and spaces changes
      7fedc5a [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into spark-sink-test
      abc20cb [Hari Shreedharan] Minor test changes
      7b9b649 [Hari Shreedharan] Merge branch 'master' into spark-sink-test
      f2c56c9 [Hari Shreedharan] Update SparkSinkSuite.scala
      a24aac8 [Hari Shreedharan] Remove unused var
      c86d615 [Hari Shreedharan] [SPARK-3054][STREAMING] Add unit tests for Spark Sink.
      8c5a2226
    • Davies Liu's avatar
      [SPARK-3141] [PySpark] fix sortByKey() with take() · 0a7ef633
      Davies Liu authored
      Fix sortByKey() with take()
      
      The function `f` used in mapPartitions should always return an iterator.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2045 from davies/fix_sortbykey and squashes the following commits:
      
      1160f59 [Davies Liu] fix sortByKey() with take()
      0a7ef633
    • Ken Takagiwa's avatar
      [DOCS] Fixed wrong links · 8a74e4b2
      Ken Takagiwa authored
      Author: Ken Takagiwa <ugw.gi.world@gmail.com>
      
      Closes #2042 from giwa/patch-1 and squashes the following commits:
      
      216fe0e [Ken Takagiwa] Fixed wrong links
      8a74e4b2
    • Josh Rosen's avatar
      [SPARK-2974] [SPARK-2975] Fix two bugs related to spark.local.dirs · ebcb94f7
      Josh Rosen authored
      This PR fixes two bugs related to `spark.local.dirs` and `SPARK_LOCAL_DIRS`, one where `Utils.getLocalDir()` might return an invalid directory (SPARK-2974) and another where the `SPARK_LOCAL_DIRS` override didn't affect the driver, which could cause problems when running tasks in local mode (SPARK-2975).
      
      This patch fixes both issues: the new `Utils.getOrCreateLocalRootDirs(conf: SparkConf)` utility method manages the creation of local directories and handles the precedence among the different configuration options, so we should see the same behavior whether we're running in local mode or on a worker.
      
      It's kind of a pain to mock out environment variables in tests (no easy way to mock System.getenv), so I added a `private[spark]` method to SparkConf for accessing environment variables (by default, it just delegates to System.getenv).  By subclassing SparkConf and overriding this method, we can mock out SPARK_LOCAL_DIRS in tests.
      
      I also fixed a typo in PySpark where we used `SPARK_LOCAL_DIR` instead of `SPARK_LOCAL_DIRS` (I think this was technically innocuous, but it seemed worth fixing).
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #2002 from JoshRosen/local-dirs and squashes the following commits:
      
      efad8c6 [Josh Rosen] Address review comments:
      1dec709 [Josh Rosen] Minor updates to Javadocs.
      7f36999 [Josh Rosen] Use env vars to detect if running in YARN container.
      399ac25 [Josh Rosen] Update getLocalDir() documentation.
      bb3ad89 [Josh Rosen] Remove duplicated YARN getLocalDirs() code.
      3e92d44 [Josh Rosen] Move local dirs override logic into Utils; fix bugs:
      b2c4736 [Josh Rosen] Add failing tests for SPARK-2974 and SPARK-2975.
      007298b [Josh Rosen] Allow environment variables to be mocked in tests.
      6d9259b [Josh Rosen] Fix typo in PySpark: SPARK_LOCAL_DIR should be SPARK_LOCAL_DIRS
      ebcb94f7
    • Xiangrui Meng's avatar
      [SPARK-3142][MLLIB] output shuffle data directly in Word2Vec · 0a984aa1
      Xiangrui Meng authored
      Sorry I didn't realize this in #2043. Ishiihara
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2049 from mengxr/more-w2v and squashes the following commits:
      
      050b1c5 [Xiangrui Meng] output shuffle data directly
      0a984aa1
    • Reynold Xin's avatar
      [SPARK-3119] Re-implementation of TorrentBroadcast. · 8adfbc2b
      Reynold Xin authored
      This is a re-implementation of TorrentBroadcast, with the following changes:
      
      1. Removes most of the mutable, transient state from TorrentBroadcast (e.g. totalBytes, num of blocks fetched).
      2. Removes TorrentInfo and TorrentBlock
      3. Replaces the BlockManager.getSingle call in readObject with a getLocal, resuling in one less RPC call to the BlockManagerMasterActor to find the location of the block.
      4. Removes the metadata block, resulting in one less block to fetch.
      5. Removes an extra memory copy for deserialization (by using Java's SequenceInputStream).
      
      Basically for a regular broadcasted object with only one block, the number of RPC calls goes from 5+1 to 2+1).
      
      Old TorrentBroadcast for object of a single block:
      1 RPC to ask for location of the broadcast variable
      1 RPC to ask for location of the metadata block
      1 RPC to fetch the metadata block
      1 RPC to ask for location of the first data block
      1 RPC to fetch the first data block
      1 RPC to tell the driver we put the first data block in
      i.e. 5 + 1
      
      New TorrentBroadcast for object of a single block:
      1 RPC to ask for location of the first data block
      1 RPC to get the first data block
      1 RPC to tell the driver we put the first data block in
      i.e. 2 + 1
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2030 from rxin/torrentBroadcast and squashes the following commits:
      
      5bacb9d [Reynold Xin] Always add the object to driver's block manager.
      0d8ed5b [Reynold Xin] Added getBytes to BlockManager and uses that in TorrentBroadcast.
      2d6a5fb [Reynold Xin] Use putBytes/getRemoteBytes throughout.
      3670f00 [Reynold Xin] Code review feedback.
      c1185cd [Reynold Xin] [SPARK-3119] Re-implementation of TorrentBroadcast.
      8adfbc2b
    • Xiangrui Meng's avatar
      [HOTFIX][Streaming][MLlib] use temp folder for checkpoint · fce5c0fb
      Xiangrui Meng authored
      or Jenkins will complain about no Apache header in checkpoint files. tdas rxin
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2046 from mengxr/tmp-checkpoint and squashes the following commits:
      
      0d3ec73 [Xiangrui Meng] remove ssc.stop
      9797843 [Xiangrui Meng] change checkpointDir to lazy val
      89964ab [Xiangrui Meng] use temp folder for checkpoint
      fce5c0fb
  6. Aug 19, 2014
    • Xiangrui Meng's avatar
      [SPARK-3130][MLLIB] detect negative values in naive Bayes · 068b6fe6
      Xiangrui Meng authored
      because NB treats feature values as term frequencies. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2038 from mengxr/nb-neg and squashes the following commits:
      
      52c37c3 [Xiangrui Meng] address comments
      65f892d [Xiangrui Meng] detect negative values in nb
      068b6fe6
    • wangfei's avatar
      [SQL] add note of use synchronizedMap in SQLConf · 0e3ab94d
      wangfei authored
      Refer to:
      http://stackoverflow.com/questions/510632/whats-the-difference-between-concurrenthashmap-and-collections-synchronizedmap
      Collections.synchronizedMap(map) creates a blocking Map which will degrade performance, albeit ensure consistency. So use ConcurrentHashMap(a more effective thread-safe hashmap) instead.
      
      also update HiveQuerySuite to fix test error when changed to ConcurrentHashMap.
      
      Author: wangfei <wangfei_hello@126.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #1996 from scwf/sqlconf and squashes the following commits:
      
      93bc0c5 [wangfei] revert change of HiveQuerySuite
      0cc05dd [wangfei] add note for use synchronizedMap
      3c224d31 [scwf] fix formate
      a7bcb98 [scwf] use ConcurrentHashMap in sql conf, intead synchronizedMap
      0e3ab94d
    • freeman's avatar
      [SPARK-3112][MLLIB] Add documentation and example for StreamingLR · c7252b00
      freeman authored
      Added a documentation section on StreamingLR to the ``MLlib - Linear Methods``, including a worked example.
      
      mengxr tdas
      
      Author: freeman <the.freeman.lab@gmail.com>
      
      Closes #2047 from freeman-lab/streaming-lr-docs and squashes the following commits:
      
      568d250 [freeman] Tweaks to wording / formatting
      05a1139 [freeman] Added documentation and example for StreamingLR
      c7252b00
    • Xiangrui Meng's avatar
      [MLLIB] minor update to word2vec · 1870dbaa
      Xiangrui Meng authored
      very minor update Ishiihara
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2043 from mengxr/minor-w2v and squashes the following commits:
      
      be649fd [Xiangrui Meng] remove map because we only need append
      eccefcc [Xiangrui Meng] minor updates to word2vec
      1870dbaa
    • Reynold Xin's avatar
      [SPARK-2468] Netty based block server / client module · 8b9dc991
      Reynold Xin authored
      Previous pull request (#1907) was reverted. This brings it back. Still looking into the hang.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1971 from rxin/netty1 and squashes the following commits:
      
      b0be96f [Reynold Xin] Added test to make sure outstandingRequests are cleaned after firing the events.
      4c6d0ee [Reynold Xin] Pass callbacks cleanly.
      603dce7 [Reynold Xin] Upgrade Netty to 4.0.23 to fix the DefaultFileRegion bug.
      88be1d4 [Reynold Xin] Downgrade to 4.0.21 to work around a bug in writing DefaultFileRegion.
      002626a [Reynold Xin] Remove netty-test-file.txt.
      db6e6e0 [Reynold Xin] Revert "Revert "[SPARK-2468] Netty based block server / client module""
      8b9dc991
Loading