Skip to content
Snippets Groups Projects
  1. Jul 12, 2015
    • Kay Ousterhout's avatar
      [SPARK-8880] Fix confusing Stage.attemptId member variable · 30090884
      Kay Ousterhout authored
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #7275 from kayousterhout/SPARK-8880 and squashes the following commits:
      
      3e9ce7c [Kay Ousterhout] Added missing return type
      e150278 [Kay Ousterhout] [SPARK-8880] Fix confusing Stage.attemptId member variable
      30090884
  2. Jul 11, 2015
  3. Jul 10, 2015
    • Joseph K. Bradley's avatar
      [SPARK-8994] [ML] tiny cleanups to Params, Pipeline · 0c5207c6
      Joseph K. Bradley authored
      Made default impl of Params.validateParams empty
      CC mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7349 from jkbradley/pipeline-small-cleanups and squashes the following commits:
      
      4e0f013 [Joseph K. Bradley] small cleanups after SPARK-5956
      0c5207c6
    • zhangjiajin's avatar
      [SPARK-6487] [MLLIB] Add sequential pattern mining algorithm PrefixSpan to Spark MLlib · 7f6be1f2
      zhangjiajin authored
      Add parallel PrefixSpan algorithm and test file.
      Support non-temporal sequences.
      
      Author: zhangjiajin <zhangjiajin@huawei.com>
      Author: zhang jiajin <zhangjiajin@huawei.com>
      
      Closes #7258 from zhangjiajin/master and squashes the following commits:
      
      ca9c4c8 [zhangjiajin] Modified the code according to the review comments.
      574e56c [zhangjiajin] Add new object LocalPrefixSpan, and do some optimization.
      ba5df34 [zhangjiajin] Fix a Scala style error.
      4c60fb3 [zhangjiajin] Fix some Scala style errors.
      1dd33ad [zhangjiajin] Modified the code according to the review comments.
      89bc368 [zhangjiajin] Fixed a Scala style error.
      a2eb14c [zhang jiajin] Delete PrefixspanSuite.scala
      951fd42 [zhang jiajin] Delete Prefixspan.scala
      575995f [zhangjiajin] Modified the code according to the review comments.
      91fd7e6 [zhangjiajin] Add new algorithm PrefixSpan and test file.
      7f6be1f2
    • jose.cambronero's avatar
      [SPARK-8598] [MLLIB] Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs · 9c507577
      jose.cambronero authored
      This contribution is my original work and I license it to the project under it's open source license.
      
      Author: jose.cambronero <jose.cambronero@cloudera.com>
      
      Closes #6994 from josepablocam/master and squashes the following commits:
      
      bbb30b1 [jose.cambronero] renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consistent with method name
      0d0c201 [jose.cambronero] kstTest -> kolmogorovSmirnovTest in statistics.md
      1f56371 [jose.cambronero] changed ksTest in public API to kolmogorovSmirnovTest for clarity
      a48ae7b [jose.cambronero] refactor code to account for serializable RealDistribution. Reuse testOneSample( _, cdf)
      1bb44bd [jose.cambronero]  style and doc changes. Factored out ks test into 2 separate tests
      2ec2aa6 [jose.cambronero] initialize to stdnormal when no params passed (and log). Change unit tests to approximate equivalence rather than strict
      a4bc0c7 [jose.cambronero] changed ksTest(data, distName) to ksTest(data, distName, params*) after api discussions. Changed tests and docs accordingly
      7e66f57 [jose.cambronero] copied implementation note to public api docs, and added @see for links to wiki info
      e760ebd [jose.cambronero] line length changes to fit style check
      3288e42 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
      9026895 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
      1226b30 [jose.cambronero] reindent multi-line lambdas, prior intepretation of style guide was wrong on my part
      9c0f1af [jose.cambronero] additional style changes incorporated and added documentation to mllib statistics docs
      3f81ad2 [jose.cambronero] renamed ks1 sample test for clarity
      992293b [jose.cambronero] Style changes as per comments and added implementation note explaining the distributed approach.
      6a4784f [jose.cambronero] specified what distributions are available for the convenience method ksTest(data, name) (solely standard normal)
      4b8ba61 [jose.cambronero] fixed off by 1/N in cases when post-constant adjustment ecdf is above cdf, but prior to adj it was below
      0b5e8ec [jose.cambronero] changed KS one sample test to perform just 1 distributed pass (in addition to the sorting pass), operates on each partition separately. Implementation of Sandy Ryza's algorithm
      16b5c4c [jose.cambronero] renamed dat to data and eliminated recalc of RDD size by sharing as argument between empirical and evalOneSampleP
      c18dc66 [jose.cambronero] removed ksTestOpt from API and changed comments in HypothesisTestSuite accordingly
      f6951b6 [jose.cambronero] changed style and some comments based on feedback from pull request
      b9cff3a [jose.cambronero] made small changes to pass style check
      ce8e9a1 [jose.cambronero] added kstest testing in HypothesisTestSuite
      4da189b [jose.cambronero] added user facing ks test functions
      c659ea1 [jose.cambronero] created KS test class
      13dfe4d [jose.cambronero] created test result class for ks test
      9c507577
    • Scott Taylor's avatar
      [SPARK-7735] [PYSPARK] Raise Exception on non-zero exit from pipe commands · 6e1c7e27
      Scott Taylor authored
      This will allow problems with piped commands to be detected.
      This will also allow tasks to be retried where errors are rare (such as network problems in piped commands).
      
      Author: Scott Taylor <github@megatron.me.uk>
      
      Closes #6262 from megatron-me-uk/patch-2 and squashes the following commits:
      
      04ae1d5 [Scott Taylor] Remove spurious empty line
      98fa101 [Scott Taylor] fix blank line style error
      574b564 [Scott Taylor] Merge pull request #2 from megatron-me-uk/patch-4
      0c1e762 [Scott Taylor] Update rdd pipe method for checkCode
      ab9a2e1 [Scott Taylor] Update rdd pipe tests for checkCode
      eb4801c [Scott Taylor] fix fail_condition
      b0ac3a4 [Scott Taylor] Merge pull request #1 from megatron-me-uk/megatron-me-uk-patch-1
      a307d13 [Scott Taylor] update rdd tests to test pipe modes
      34fcdc3 [Scott Taylor] add optional argument 'mode' for rdd.pipe
      a0c0161 [Scott Taylor] fix generator issue
      8a9ef9c [Scott Taylor] make check_return_code an iterator
      0486ae3 [Scott Taylor] style fixes
      8ed89a6 [Scott Taylor] Chain generators to prevent potential deadlock
      4153b02 [Scott Taylor] fix list.sort returns None
      491d3fc [Scott Taylor] Pass a function handle to assertRaises
      3344a21 [Scott Taylor] wrap assertRaises with QuietTest
      3ab8c7a [Scott Taylor] remove whitespace for style
      cc1a73d [Scott Taylor] fix style issues in pipe test
      8db4073 [Scott Taylor] Add a test for rdd pipe functions
      1b3dc4e [Scott Taylor] fix missing space around operator style
      0974f98 [Scott Taylor] add space between words in multiline string
      45f4977 [Scott Taylor] fix line too long style error
      5745d85 [Scott Taylor] Remove space to fix style
      f552d49 [Scott Taylor] Catch non-zero exit from pipe commands
      6e1c7e27
    • Cheng Lian's avatar
      [SPARK-8961] [SQL] Makes BaseWriterContainer.outputWriterForRow accepts InternalRow instead of Row · 33630883
      Cheng Lian authored
      This is a follow-up of [SPARK-8888] [1], which also aims to optimize writing dynamic partitions.
      
      Three more changes can be made here:
      
      1. Using `InternalRow` instead of `Row` in `BaseWriterContainer.outputWriterForRow`
      2. Using `Cast` expressions to convert partition columns to strings, so that we can leverage code generation.
      3. Replacing the FP-style `zip` and `map` calls with a faster imperative `while` loop.
      
      [1]: https://issues.apache.org/jira/browse/SPARK-8888
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7331 from liancheng/spark-8961 and squashes the following commits:
      
      b5ab9ae [Cheng Lian] Casts Java iterator to Scala iterator explicitly
      719e63b [Cheng Lian] Makes BaseWriterContainer.outputWriterForRow accepts InternalRow instead of Row
      33630883
    • Davies Liu's avatar
      add inline comment for python tests · b6fc0adf
      Davies Liu authored
      b6fc0adf
    • Cheng Lian's avatar
      [SPARK-8990] [SQL] SPARK-8990 DataFrameReader.parquet() should respect user specified options · 857e325f
      Cheng Lian authored
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7347 from liancheng/spark-8990 and squashes the following commits:
      
      045698c [Cheng Lian] SPARK-8990 DataFrameReader.parquet() should respect user specified options
      857e325f
    • Josh Rosen's avatar
      [SPARK-7078] [SPARK-7079] Binary processing sort for Spark SQL · fb8807c9
      Josh Rosen authored
      This patch adds a cache-friendly external sorter which operates on serialized bytes and uses this sorter to implement a new sort operator for Spark SQL and DataFrames.
      
      ### Overview of the new sorter
      
      The new sorter design is inspired by [Alphasort](http://research.microsoft.com/pubs/68249/alphasort.doc) and implements a key-prefix optimization in order to improve the cache friendliness of the sort.  In naive sort implementations, the sorting algorithm operates on an array of record pointers.  To compare two records for ordering, the sorter must dereference these pointers, which likely involves random memory access, then compare the objects themselves.
      
      ![image](https://cloud.githubusercontent.com/assets/50748/8611390/3b1402ae-2675-11e5-8308-1a10bf347e6e.png)
      
      In a key-prefix sort, the sort operates on an array which stores the record pointer alongside a prefix of the record's key. When comparing two records for ordering, the sorter first compares the the stored key prefixes. If the ordering can be determined from the key prefixes (i.e. the prefixes are unequal), then the sort can avoid directly comparing the records, avoiding random memory accesses and full record comparisons. For example, if we're sorting a list of strings then we can store the first 8 bytes of the UTF-8 encoded string as the key-prefix and can perform unsigned byte-at-a-time comparisons to determine the ordering of strings based on their prefixes, only resorting to full comparisons for strings that share a common prefix.  In cases where the sort key can fit entirely in the space allotted for the key prefix (e.g. the sorting key is an integer), we completely avoid direct record comparison.
      
      In this patch's implementation of key-prefix sorting, our sorter's internal array stores a 64-bit long and 64-bit pointer for each record being sorted. The key prefixes are generated by the user when inserting records into the sorter, which uses a user-defined comparison function for comparing them.  The `PrefixComparators` object implements a set of comparators for many common types, including primitive numeric types and UTF-8 strings.
      
      The actual sorting is implemented by `UnsafeInMemorySorter`.  Most consumers will not use this directly, but instead will use `UnsafeExternalSorter`, a class which implements a sort that can spill to disk in response to memory pressure.  Internally, `UnsafeExternalSorter` creates `UnsafeInMemorySorters` to perform sorting and uses `UnsafeSortSpillReader/Writer` to spill and read back runs of sorted records and `UnsafeSortSpillMerger` to merge multiple sorted spills into a single sorted iterator.  This external sorter integrates with Spark's existing ShuffleMemoryManager for controlling spilling.
      
      Many parts of this sorter's design are based on / copied from the more specialized external sort implementation that I designed for the new UnsafeShuffleManager write path; see #5868 for more details on that patch.
      
      ### Sorting rows in Spark SQL
      
      For now, `UnsafeExternalSorter` is only used by Spark SQL, which uses it to implement a new sort operator, `UnsafeExternalSort`.  This sort operator uses a SQL-specific class called `UnsafeExternalRowSorter` that configures an `UnsafeExternalSorter` to use prefix generators and comparators that operate on rows encoded in the UnsafeRow format that was designed for Project Tungsten.
      
      I used some interesting unit-testing techniques to test this patch's SQL-specific components.  `UnsafeExternalSortSuite` uses the SQL random data generators introduced in #7176 to test the UnsafeSort operator with all atomic types both with and without nullability and in both ascending and descending sort orders.  `PrefixComparatorsSuite` contains a cool use of ScalaCheck + ScalaTest's `GeneratorDrivenPropertyChecks` in order to test UTF8String prefix comparison.
      
      ### Misc. additional improvements made in this patch
      
      This patch made several miscellaneous improvements to related code in Spark SQL:
      
      - The logic for selecting physical sort operator implementations, which was partially duplicated in both `Exchange` and `SparkStrategies, has now been consolidated into a `getSortOperator()` helper function in `SparkStrategies`.
      - The `SparkPlanTest` unit testing helper trait has been extended with new methods for comparing the output produced by two different physical plans. This makes it easy to write tests which assert that two physical operator implementations should produce the same output.  I also added a method for disabling the implicit sorting of outputs prior to comparing them, a change which is necessary in order to be able to write proper SparkPlan tests for sort operators.
      
      ### Tasks deferred to followup patches
      
      While most of this patch's features are reasonably well-tested and complete, there are a number of tasks that are intentionally being deferred to followup patches:
      
      - Add tests which mock the ShuffleMemoryManager to check that memory pressure properly triggers spilling (there are examples of this type of test in #5868).
      - Add tests to ensure that spill files are properly cleaned up after errors.  I'd like to do this in the context of a patch which introduces more general metrics for ensuring proper cleanup of tasks' temporary files; see https://issues.apache.org/jira/browse/SPARK-8966 for more details.
      - Metrics integration: there are some open questions regarding how to track / report spill metrics for non-shuffle operations, so I've deferred most of the IO / shuffle metrics integration for now.
      - Performance profiling.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6444)
      <!-- Reviewable:end -->
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6444 from JoshRosen/sql-external-sort and squashes the following commits:
      
      6beb467 [Josh Rosen] Remove a bunch of overloaded methods to avoid default args. issue
      2bbac9c [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort
      35dad9f [Josh Rosen] Make sortAnswers = false the default in SparkPlanTest
      5135200 [Josh Rosen] Fix spill reading for large rows; add test
      2f48777 [Josh Rosen] Add test and fix bug for sorting empty arrays
      d1e28bc [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort
      cd05866 [Josh Rosen] Fix scalastyle
      3947fc1 [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort
      d13ac55 [Josh Rosen] Hacky approach to copying of UnsafeRows for sort followed by limit.
      845bea3 [Josh Rosen] Remove unnecessary zeroing of row conversion buffer
      c56ec18 [Josh Rosen] Clean up final row copying code.
      d31f180 [Josh Rosen] Re-enable NullType sorting test now that SPARK-8868 is fixed
      844f4ca [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort
      293f109 [Josh Rosen] Add missing license header.
      f99a612 [Josh Rosen] Fix bugs in string prefix comparison.
      9d00afc [Josh Rosen] Clean up prefix comparators for integral types
      88aff18 [Josh Rosen] NULL_PREFIX has to be negative infinity for floating point types
      613e16f [Josh Rosen] Test with larger data.
      1d7ffaa [Josh Rosen] Somewhat hacky fix for descending sorts
      08701e7 [Josh Rosen] Fix prefix comparison of null primitives.
      b86e684 [Josh Rosen] Set global = true in UnsafeExternalSortSuite.
      1c7bad8 [Josh Rosen] Make sorting of answers explicit in SparkPlanTest.checkAnswer().
      b81a920 [Josh Rosen] Temporarily enable only the passing sort tests
      5d6109d [Josh Rosen] Fix inconsistent handling / encoding of record lengths.
      87b6ed9 [Josh Rosen] Fix critical issues in test which led to false negatives.
      8d7fbe7 [Josh Rosen] Fixes to multiple spilling-related bugs.
      82e21c1 [Josh Rosen] Force spilling in UnsafeExternalSortSuite.
      88b72db [Josh Rosen] Test ascending and descending sort orders.
      f27be09 [Josh Rosen] Fix tests by binding attributes.
      0a79d39 [Josh Rosen] Revert "Undo part of a SparkPlanTest change in #7162 that broke my test."
      7c3c864 [Josh Rosen] Undo part of a SparkPlanTest change in #7162 that broke my test.
      9969c14 [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort
      5822e6f [Josh Rosen] Fix test compilation issue
      939f824 [Josh Rosen] Remove code gen experiment.
      0dfe919 [Josh Rosen] Implement prefix sort for strings (albeit inefficiently).
      66a813e [Josh Rosen] Prefix comparators for float and double
      b310c88 [Josh Rosen] Integrate prefix comparators for Int and Long (others coming soon)
      95058d9 [Josh Rosen] Add missing SortPrefixUtils file
      4c37ba6 [Josh Rosen] Add tests for sorting on all primitive types.
      6890863 [Josh Rosen] Fix memory leak on empty inputs.
      d246e29 [Josh Rosen] Fix consideration of column types when choosing sort implementation.
      6b156fb [Josh Rosen] Some WIP work on prefix comparison.
      7f875f9 [Josh Rosen] Commit failing test demonstrating bug in handling objects in spills
      41b8881 [Josh Rosen] Get UnsafeInMemorySorterSuite to pass (WIP)
      90c2b6a [Josh Rosen] Update test name
      6d6a1e6 [Josh Rosen] Centralize logic for picking sort operator implementations
      9869ec2 [Josh Rosen] Clean up Exchange code a bit
      82bb0ec [Josh Rosen] Fix IntelliJ complaint due to negated if condition
      1db845a [Josh Rosen] Many more changes to harmonize with shuffle sorter
      ebf9eea [Josh Rosen] Harmonization with shuffle's unsafe sorter
      206bfa2 [Josh Rosen] Add some missing newlines at the ends of files
      26c8931 [Josh Rosen] Back out some Hive changes that aren't needed anymore
      62f0bb8 [Josh Rosen] Update to reflect SparkPlanTest changes
      21d7d93 [Josh Rosen] Back out of BlockObjectWriter change
      7eafecf [Josh Rosen] Port test to SparkPlanTest
      d468a88 [Josh Rosen] Update for InternalRow refactoring
      269cf86 [Josh Rosen] Back out SMJ operator change; isolate changes to selection of sort op.
      1b841ca [Josh Rosen] WIP towards copying
      b420a71 [Josh Rosen] Move most of the existing SMJ code into Java.
      dfdb93f [Josh Rosen] SparkFunSuite change
      73cc761 [Josh Rosen] Fix whitespace
      9cc98f5 [Josh Rosen] Move more code to Java; fix bugs in UnsafeRowConverter length type.
      c8792de [Josh Rosen] Remove some debug logging
      dda6752 [Josh Rosen] Commit some missing code from an old git stash.
      58f36d0 [Josh Rosen] Merge in a sketch of a unit test for the new sorter (now failing).
      2bd8c9a [Josh Rosen] Import my original tests and get them to pass.
      d5d3106 [Josh Rosen] WIP towards external sorter for Spark SQL.
      fb8807c9
    • rahulpalamuttam's avatar
      [SPARK-8923] [DOCUMENTATION, MLLIB] Add @since tags to mllib.fpm · 0772026c
      rahulpalamuttam authored
      Author: rahulpalamuttam <rahulpalamut@gmail.com>
      
      Closes #7341 from rahulpalamuttam/TaggingMLlibfpm and squashes the following commits:
      
      bef2843 [rahulpalamuttam] fix @since tags in mmlib.fpm
      cd86252 [rahulpalamuttam] Add @since tags to mllib.fpm
      0772026c
    • Davies Liu's avatar
      [HOTFIX] fix flaky test in PySpark SQL · 05ac023d
      Davies Liu authored
      It may loss precision in microseconds when using float for it.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7344 from davies/fix_date_test and squashes the following commits:
      
      249ec61 [Davies Liu] fix flaky test
      05ac023d
    • Min Zhou's avatar
      [SPARK-8675] Executors created by LocalBackend won't get the same classpath as... · c185f3a4
      Min Zhou authored
      [SPARK-8675] Executors created by LocalBackend won't get the same classpath as other executor backends
      
      AFAIK, some spark application always use LocalBackend to do some local initiatives, spark sql is an example. Starting a LocalPoint won't add user classpath into executor.
      ```java
        override def start() {
          localEndpoint = SparkEnv.get.rpcEnv.setupEndpoint(
            "LocalBackendEndpoint", new LocalEndpoint(SparkEnv.get.rpcEnv, scheduler, this, totalCores))
        }
      ```
      Thus will cause local executor fail with these scenarios, loading hadoop built-in native libraries, loading other user defined native libraries, loading user jars, reading s3 config from a site.xml file, etc
      
      Author: Min Zhou <coderplay@gmail.com>
      
      Closes #7091 from coderplay/master and squashes the following commits:
      
      365838f [Min Zhou] Fixed java.net.MalformedURLException, add default scheme, support relative path
      d215b7f [Min Zhou] Follows spark standard scala style, make the auto testing happy
      84ad2cd [Min Zhou] Use system specific path separator instead of ','
      01f5d1a [Min Zhou] Merge branch 'master' of https://github.com/apache/spark
      e528be7 [Min Zhou] Merge branch 'master' of https://github.com/apache/spark
      45bf62c [Min Zhou] SPARK-8675 Executors created by LocalBackend won't get the same classpath as other executor backends
      c185f3a4
    • Cheng Hao's avatar
      [CORE] [MINOR] change the log level to info · db6d57f8
      Cheng Hao authored
      Too many logs even when set the log level to warning.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #7340 from chenghao-intel/log and squashes the following commits:
      
      59658cf [Cheng Hao] change the log level to info
      db6d57f8
    • Andrew Or's avatar
      [SPARK-8958] Dynamic allocation: change cached timeout to infinity · 5dd45bde
      Andrew Or authored
      pwendell and I discussed this a little more offline and concluded that it would be good to keep it more conservative. Losing cached blocks may be very expensive and we should only allow it if the user knows what he/she is doing.
      
      FYI harishreedharan sryza.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #7329 from andrewor14/da-cached-timeout and squashes the following commits:
      
      cef0b4e [Andrew Or] Change timeout to infinity
      5dd45bde
    • Iulian Dragos's avatar
      [SPARK-7944] [SPARK-8013] Remove most of the Spark REPL fork for Scala 2.11 · 11e22b74
      Iulian Dragos authored
      This PR removes most of the code in the Spark REPL for Scala 2.11 and leaves just a couple of overridden methods in `SparkILoop` in order to:
      
      - change welcome message
      - restrict available commands (like `:power`)
      - initialize Spark context
      
      The two codebases have diverged and it's extremely hard to backport fixes from the upstream REPL. This somewhat radical step is absolutely necessary in order to fix other REPL tickets (like SPARK-8013 - Hive Thrift server for 2.11). BTW, the Scala REPL has fixed the serialization-unfriendly wrappers thanks to ScrapCodes's work in [#4522](https://github.com/scala/scala/pull/4522)
      
      All tests pass and I tried the `spark-shell` on our Mesos cluster with some simple jobs (including with additional jars), everything looked good.
      
      As soon as Scala 2.11.7 is out we need to upgrade and get a shaded `jline` dependency, clearing the way for SPARK-8013.
      
      /cc pwendell
      
      Author: Iulian Dragos <jaguarul@gmail.com>
      
      Closes #6903 from dragos/issue/no-spark-repl-fork and squashes the following commits:
      
      c596c6f [Iulian Dragos] Merge branch 'master' into issue/no-spark-repl-fork
      2b1a305 [Iulian Dragos] Removed spaces around multiple imports.
      0ce67a6 [Iulian Dragos] Remove -verbose flag for java compiler (added by mistake in an earlier commit).
      10edaf9 [Iulian Dragos] Keep the jline dependency only in the 2.10 build.
      529293b [Iulian Dragos] Add back Spark REPL files to rat-excludes, since they are part of the 2.10 real.
      d85370d [Iulian Dragos] Remove jline dependency from the Spark REPL.
      b541930 [Iulian Dragos] Merge branch 'master' into issue/no-spark-repl-fork
      2b15962 [Iulian Dragos] Change jline dependency and bump Scala version.
      b300183 [Iulian Dragos] Rename package and add license on top of the file, remove files from rat-excludes and removed `-Yrepl-sync` per reviewer’s request.
      9d46d85 [Iulian Dragos] Fix SPARK-7944.
      abcc7cb [Iulian Dragos] Remove the REPL forked code.
      11e22b74
    • Jonathan Alter's avatar
      [SPARK-7977] [BUILD] Disallowing println · e14b545d
      Jonathan Alter authored
      Author: Jonathan Alter <jonalter@users.noreply.github.com>
      
      Closes #7093 from jonalter/SPARK-7977 and squashes the following commits:
      
      ccd44cc [Jonathan Alter] Changed println to log in ThreadingSuite
      7fcac3e [Jonathan Alter] Reverting to println in ThreadingSuite
      10724b6 [Jonathan Alter] Changing some printlns to logs in tests
      eeec1e7 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      0b1dcb4 [Jonathan Alter] More println cleanup
      aedaf80 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      925fd98 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      0c16fa3 [Jonathan Alter] Replacing some printlns with logs
      45c7e05 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      5c8e283 [Jonathan Alter] Allowing println in audit-release examples
      5b50da1 [Jonathan Alter] Allowing printlns in example files
      ca4b477 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      83ab635 [Jonathan Alter] Fixing new printlns
      54b131f [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      1cd8a81 [Jonathan Alter] Removing some unnecessary comments and printlns
      b837c3a [Jonathan Alter] Disallowing println
      e14b545d
  4. Jul 09, 2015
    • Michael Vogiatzis's avatar
      [DOCS] Added important updateStateByKey details · d538919c
      Michael Vogiatzis authored
      Runs for *all* existing keys and returning "None" will remove the key-value pair.
      
      Author: Michael Vogiatzis <michaelvogiatzis@gmail.com>
      
      Closes #7229 from mvogiatzis/patch-1 and squashes the following commits:
      
      e7a2946 [Michael Vogiatzis] Updated updateStateByKey text
      00283ed [Michael Vogiatzis] Removed space
      c2656f9 [Michael Vogiatzis] Moved description farther up
      0a42551 [Michael Vogiatzis] Added important updateStateByKey details
      d538919c
    • huangzhaowei's avatar
      [SPARK-8839] [SQL] ThriftServer2 will remove session and execution no matter it's finished or not. · 1903641e
      huangzhaowei authored
      In my test, `sessions` and `executions` in ThriftServer2 is not the same number as the connection number.
      For example, if there are 200 clients connecting to the server,  but it will have more than 200 `sessions` and `executions`.
      So if it reaches the `retainedStatements`, it has to remove some object which is not finished.
      So it may cause the exception described in [Jira Address](https://issues.apache.org/jira/browse/SPARK-8839)
      
      Author: huangzhaowei <carlmartinmax@gmail.com>
      
      Closes #7239 from SaintBacchus/SPARK-8839 and squashes the following commits:
      
      cf7ef40 [huangzhaowei] Remove the a meanless funciton call
      3e9a5a6 [huangzhaowei] Add a filter before take
      9d5ceb8 [huangzhaowei] [SPARK-8839][SQL]ThriftServer2 will remove session and execution no matter it's finished or not.
      1903641e
    • Holden Karau's avatar
      [SPARK-8913] [ML] Simplify LogisticRegression suite to use Vector Vector comparision · 27273046
      Holden Karau authored
      Cleanup tests from SPARK 8700.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #7335 from holdenk/SPARK-8913-cleanup-tests-from-SPARK-8700-logistic-regression-r2-really-logistic-regression-this-time and squashes the following commits:
      
      e5e2c5f [Holden Karau] Simplify LogisticRegression suite to use Vector <-> Vector comparisions instead of comparing element by element
      27273046
    • Marcelo Vanzin's avatar
      [SPARK-8852] [FLUME] Trim dependencies in flume assembly. · 0e78e40c
      Marcelo Vanzin authored
      Also, add support for the *-provided profiles. This avoids repackaging
      things that are already in the Spark assembly, or, in the case of the
      *-provided profiles, are provided by the distribution.
      
      The flume-ng-auth dependency was also excluded since it's not really
      used by Spark.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #7247 from vanzin/SPARK-8852 and squashes the following commits:
      
      298a7d5 [Marcelo Vanzin] Feedback.
      c962082 [Marcelo Vanzin] [SPARK-8852] [flume] Trim dependencies in flume assembly.
      0e78e40c
    • Cheng Lian's avatar
      [SPARK-8959] [SQL] [HOTFIX] Removes parquet-thrift and libthrift dependencies · 2d45571f
      Cheng Lian authored
      These two dependencies were introduced in #7231 to help testing Parquet compatibility with `parquet-thrift`. However, they somehow crash the Scala compiler in Maven builds.
      
      This PR fixes this issue by:
      
      1. Removing these two dependencies, and
      2. Instead of generating the testing Parquet file programmatically, checking in an actual testing Parquet file generated by `parquet-thrift` as a test resource.
      
      This is just a quick fix to bring back Maven builds. Need to figure out the root case as binary Parquet files are harder to maintain.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7330 from liancheng/spark-8959 and squashes the following commits:
      
      cf69512 [Cheng Lian] Brings back Maven builds
      2d45571f
    • Feynman Liang's avatar
      [SPARK-8538] [SPARK-8539] [ML] Linear Regression Training and Testing Results · a0cc3e5a
      Feynman Liang authored
      Adds results (e.g. objective value at each iteration, residuals) on training and user-specified test sets for LinearRegressionModel.
      
      Notes to Reviewers:
       * Are the `*TrainingResults` and `Results` classes too specialized for `LinearRegressionModel`? Where would be an appropriate level of abstraction?
       * Please check `transient` annotations are correct; the datasets should not be copied and kept during serialization.
       * Any thoughts on `RDD`s versus `DataFrame`s? If using `DataFrame`s, suggested schemas for each intermediate step? Also, how to create a "local DataFrame" without a `sqlContext`?
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7099 from feynmanliang/SPARK-8538 and squashes the following commits:
      
      d219fa4 [Feynman Liang] Update docs
      4a42680 [Feynman Liang] Change Summary to hold values, move transient annotations down to metrics and predictions DF
      6300031 [Feynman Liang] Code review changes
      0a5e762 [Feynman Liang] Fix build error
      e71102d [Feynman Liang] Merge branch 'master' into SPARK-8538
      3367489 [Feynman Liang] Merge branch 'master' into SPARK-8538
      70f267c [Feynman Liang] Make TrainingSummary transient and remove Serializable from *Summary and RegressionMetrics
      1d9ea42 [Feynman Liang] Fix failing Java test
      a65dfda [Feynman Liang] Make TrainingSummary and metrics serializable, prediction dataframe transient
      0a605d8 [Feynman Liang] Replace Params from LinearRegression*Summary with private constructor vals
      c2fe835 [Feynman Liang] Optimize imports
      02d8a70 [Feynman Liang] Add Params to LinearModel*Summary, refactor tests and add test for evaluate()
      8f999f4 [Feynman Liang] Refactor from jkbradley code review
      072e948 [Feynman Liang] Style
      509ae36 [Feynman Liang] Use DFs and localize serialization to LinearRegressionModel
      9509c79 [Feynman Liang] Fix imports
      b2bbaa3 [Feynman Liang] Refactored LinearRegressionResults API to be more private
      ffceaec [Feynman Liang] Merge branch 'master' into SPARK-8538
      1cedb2b [Feynman Liang] Add test for decreasing objective trace
      dab0aff [Feynman Liang] Add LinearRegressionTrainingResults tests, make test suite code copy+pasteable
      97b0a81 [Feynman Liang] Add LinearRegressionModel.evaluate() to get results on test sets
      dc51bce [Feynman Liang] Style guide fixes
      521f397 [Feynman Liang] Use RDD[(Double, Double)] instead of DF
      2ff5710 [Feynman Liang] Add training results and model summary to ML LinearRegression
      a0cc3e5a
    • Holden Karau's avatar
      [SPARK-8963][ML] cleanup tests in linear regression suite · e29ce319
      Holden Karau authored
      Simplify model weight assertions to use vector comparision, switch to using absTol when comparing with 0.0 intercepts
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #7327 from holdenk/SPARK-8913-cleanup-tests-from-SPARK-8700-logistic-regression and squashes the following commits:
      
      5bac185 [Holden Karau] Simplify model weight assertions to use vector comparision, switch to using absTol when comparing with 0.0 intercepts
      e29ce319
    • Xiangrui Meng's avatar
      Closes #6837 · 69165330
      Xiangrui Meng authored
      Closes #7321
      Closes #2634
      Closes #4963
      Closes #2137
      69165330
    • guowei2's avatar
      [SPARK-8865] [STREAMING] FIX BUG: check key in kafka params · 89770036
      guowei2 authored
      Author: guowei2 <guowei@growingio.com>
      
      Closes #7254 from guowei2/spark-8865 and squashes the following commits:
      
      48ca17a [guowei2] fix contains key
      89770036
    • Davies Liu's avatar
      [SPARK-7902] [SPARK-6289] [SPARK-8685] [SQL] [PYSPARK] Refactor of... · c9e2ef52
      Davies Liu authored
      [SPARK-7902] [SPARK-6289] [SPARK-8685] [SQL] [PYSPARK] Refactor of serialization for Python DataFrame
      
      This PR fix the long standing issue of serialization between Python RDD and DataFrame, it change to using a customized Pickler for InternalRow to enable customized unpickling (type conversion, especially for UDT), now we can support UDT for UDF, cc mengxr .
      
      There is no generated `Row` anymore.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7301 from davies/sql_ser and squashes the following commits:
      
      81bef71 [Davies Liu] address comments
      e9217bd [Davies Liu] add regression tests
      db34167 [Davies Liu] Refactor of serialization for Python DataFrame
      c9e2ef52
    • jerryshao's avatar
      [SPARK-8389] [STREAMING] [PYSPARK] Expose KafkaRDDs offsetRange in Python · 3ccebf36
      jerryshao authored
      This PR propose a simple way to expose OffsetRange in Python code, also the usage of offsetRanges is similar to Scala/Java way, here in Python we could get OffsetRange like:
      
      ```
      dstream.foreachRDD(lambda r: KafkaUtils.offsetRanges(r))
      ```
      
      Reason I didn't follow the way what SPARK-8389 suggested is that: Python Kafka API has one more step to decode the message compared to Scala/Java, Which makes Python API return a transformed RDD/DStream, not directly wrapped so-called JavaKafkaRDD, so it is hard to backtrack to the original RDD to get the offsetRange.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #7185 from jerryshao/SPARK-8389 and squashes the following commits:
      
      4c6d320 [jerryshao] Another way to fix subclass deserialization issue
      e6a8011 [jerryshao] Address the comments
      fd13937 [jerryshao] Fix serialization bug
      7debf1c [jerryshao] bug fix
      cff3893 [jerryshao] refactor the code according to the comments
      2aabf9e [jerryshao] Style fix
      848c708 [jerryshao] Add HasOffsetRanges for Python
      3ccebf36
    • zsxwing's avatar
      [SPARK-8701] [STREAMING] [WEBUI] Add input metadata in the batch page · 1f6b0b12
      zsxwing authored
      This PR adds `metadata` to `InputInfo`. `InputDStream` can report its metadata for a batch and it will be shown in the batch page.
      
      For example,
      
      ![screen shot](https://cloud.githubusercontent.com/assets/1000778/8403741/d6ffc7e2-1e79-11e5-9888-c78c1575123a.png)
      
      FileInputDStream will display the new files for a batch, and DirectKafkaInputDStream will display its offset ranges.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #7081 from zsxwing/input-metadata and squashes the following commits:
      
      f7abd9b [zsxwing] Revert the space changes in project/MimaExcludes.scala
      d906209 [zsxwing] Merge branch 'master' into input-metadata
      74762da [zsxwing] Fix MiMa tests
      7903e33 [zsxwing] Merge branch 'master' into input-metadata
      450a46c [zsxwing] Address comments
      1d94582 [zsxwing] Raname InputInfo to StreamInputInfo and change "metadata" to Map[String, Any]
      d496ae9 [zsxwing] Add input metadata in the batch page
      1f6b0b12
    • Iulian Dragos's avatar
      [SPARK-6287] [MESOS] Add dynamic allocation to the coarse-grained Mesos scheduler · c4830598
      Iulian Dragos authored
      This is largely based on extracting the dynamic allocation parts from tnachen's #3861.
      
      Author: Iulian Dragos <jaguarul@gmail.com>
      
      Closes #4984 from dragos/issue/mesos-coarse-dynamicAllocation and squashes the following commits:
      
      39df8cd [Iulian Dragos] Update tests to latest changes in core.
      9d2c9fa [Iulian Dragos] Remove adjustment of executorLimitOption in doKillExecutors.
      8b00f52 [Iulian Dragos] Latest round of reviews.
      0cd00e0 [Iulian Dragos] Add persistent shuffle directory
      15c45c1 [Iulian Dragos] Add dynamic allocation to the Spark coarse-grained scheduler.
      c4830598
    • Andrew Or's avatar
      [SPARK-2017] [UI] Stage page hangs with many tasks · ebdf5853
      Andrew Or authored
      (This reopens a patch that was closed in the past: #6248)
      
      When you view the stage page while running the following:
      ```
      sc.parallelize(1 to X, 10000).count()
      ```
      The page never loads, the job is stalled, and you end up running into an OOM:
      ```
      HTTP ERROR 500
      
      Problem accessing /stages/stage/. Reason:
          Server Error
      Caused by:
      java.lang.OutOfMemoryError: Java heap space
          at java.util.Arrays.copyOf(Arrays.java:2367)
          at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
      ```
      This patch compresses Jetty responses in gzip. The correct long-term fix is to add pagination.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #7296 from andrewor14/gzip-jetty and squashes the following commits:
      
      a051c64 [Andrew Or] Use GZIP to compress Jetty responses
      ebdf5853
    • zsxwing's avatar
      [SPARK-7419] [STREAMING] [TESTS] Fix CheckpointSuite.recovery with file input stream · 88bf4303
      zsxwing authored
      Fix this failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2886/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.streaming/CheckpointSuite/recovery_with_file_input_stream/
      
      To reproduce this failure, you can add `Thread.sleep(2000)` before this line
      https://github.com/apache/spark/blob/a9c4e29950a14e32acaac547e9a0e8879fd37fc9/streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala#L477
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #7323 from zsxwing/SPARK-7419 and squashes the following commits:
      
      b3caf58 [zsxwing] Fix CheckpointSuite.recovery with file input stream
      88bf4303
    • xutingjun's avatar
      [SPARK-8953] SPARK_EXECUTOR_CORES is not read in SparkSubmit · 930fe953
      xutingjun authored
      The configuration ```SPARK_EXECUTOR_CORES``` won't put into ```SparkConf```, so it has no effect to the dynamic executor allocation.
      
      Author: xutingjun <xutingjun@huawei.com>
      
      Closes #7322 from XuTingjun/SPARK_EXECUTOR_CORES and squashes the following commits:
      
      2cafa89 [xutingjun] make SPARK_EXECUTOR_CORES has effect to dynamicAllocation
      930fe953
    • Tathagata Das's avatar
      [MINOR] [STREAMING] Fix log statements in ReceiverSupervisorImpl · 7ce3b818
      Tathagata Das authored
      Log statements incorrectly showed that the executor was being stopped when receiver was being stopped.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #7328 from tdas/fix-log and squashes the following commits:
      
      9cc6e99 [Tathagata Das] Fix log statements.
      7ce3b818
    • Cheng Hao's avatar
      [SPARK-8247] [SPARK-8249] [SPARK-8252] [SPARK-8254] [SPARK-8257] [SPARK-8258]... · 0b0b9cea
      Cheng Hao authored
      [SPARK-8247] [SPARK-8249] [SPARK-8252] [SPARK-8254] [SPARK-8257] [SPARK-8258] [SPARK-8259] [SPARK-8261] [SPARK-8262] [SPARK-8253] [SPARK-8260] [SPARK-8267] [SQL] Add String Expressions
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #6762 from chenghao-intel/str_funcs and squashes the following commits:
      
      b09a909 [Cheng Hao] update the code as feedback
      7ebbf4c [Cheng Hao] Add more string expressions
      0b0b9cea
    • Yuhao Yang's avatar
      [SPARK-8703] [ML] Add CountVectorizer as a ml transformer to convert document to words count vector · 0cd84c86
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-8703
      
      Converts a text document to a sparse vector of token counts.
      
      I can further add an estimator to extract vocabulary from corpus if that's appropriate.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #7084 from hhbyyh/countVectorization and squashes the following commits:
      
      5f3f655 [Yuhao Yang] text change
      24728e4 [Yuhao Yang] style improvement
      576728a [Yuhao Yang] rename to model and some fix
      1deca28 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into countVectorization
      99b0c14 [Yuhao Yang] undo extension from HashingTF
      12c2dc8 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into countVectorization
      7ee1c31 [Yuhao Yang] extends HashingTF
      809fb59 [Yuhao Yang] minor fix for ut
      7c61fb3 [Yuhao Yang] add countVectorizer
      0cd84c86
    • JPark's avatar
      [SPARK-8863] [EC2] Check aws access key from aws credentials if there is no boto config · c59e268d
      JPark authored
      'spark_ec2.py' use boto to control ec2.
      And boto can support '~/.aws/credentials' which is AWS CLI default configuration file.
      
      We can check this information from ref of boto.
      
      "A boto config file is a text file formatted like an .ini configuration file that specifies values for options that control the behavior of the boto library. In Unix/Linux systems, on startup, the boto library looks for configuration files in the following locations and in the following order:
      /etc/boto.cfg - for site-wide settings that all users on this machine will use
      (if profile is given) ~/.aws/credentials - for credentials shared between SDKs
      (if profile is given) ~/.boto - for user-specific settings
      ~/.aws/credentials - for credentials shared between SDKs
      ~/.boto - for user-specific settings"
      
      * ref of boto: http://boto.readthedocs.org/en/latest/boto_config_tut.html
      * ref of aws cli : http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
      
      However 'spark_ec2.py' only check boto config & environment variable even if there is '~/.aws/credentials', and 'spark_ec2.py' is terminated.
      
      So I changed to check '~/.aws/credentials'.
      
      cc rxin
      
      Jira : https://issues.apache.org/jira/browse/SPARK-8863
      
      Author: JPark <JPark@JPark.me>
      
      Closes #7252 from JuhongPark/master and squashes the following commits:
      
      23c5792 [JPark] Check aws access key from aws credentials if there is no boto config
      c59e268d
    • Wenchen Fan's avatar
      [SPARK-8938][SQL] Implement toString for Interval data type · f6c0bd5c
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7315 from cloud-fan/toString and squashes the following commits:
      
      4fc8d80 [Wenchen Fan] Implement toString for Interval data type
      f6c0bd5c
    • Reynold Xin's avatar
      [SPARK-8926][SQL] Code review followup. · a870a82f
      Reynold Xin authored
      I merged https://github.com/apache/spark/pull/7303 so it unblocks another PR. This addresses my own code review comment for that PR.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7313 from rxin/adt and squashes the following commits:
      
      7ade82b [Reynold Xin] Fixed unit tests.
      f8d5533 [Reynold Xin] [SPARK-8926][SQL] Code review followup.
      a870a82f
    • Reynold Xin's avatar
      [SPARK-8948][SQL] Remove ExtractValueWithOrdinal abstract class · e204d22b
      Reynold Xin authored
      Also added more documentation for the file.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7316 from rxin/extract-value and squashes the following commits:
      
      069cb7e [Reynold Xin] Removed ExtractValueWithOrdinal.
      621b705 [Reynold Xin] Reverted a line.
      11ebd6c [Reynold Xin] [Minor][SQL] Improve documentation for complex type extractors.
      e204d22b
Loading