Skip to content
Snippets Groups Projects
  1. May 13, 2015
    • Josh Rosen's avatar
      [SPARK-7081] Faster sort-based shuffle path using binary processing cache-aware sort · 73bed408
      Josh Rosen authored
      This patch introduces a new shuffle manager that enhances the existing sort-based shuffle with a new cache-friendly sort algorithm that operates directly on binary data. The goals of this patch are to lower memory usage and Java object overheads during shuffle and to speed up sorting. It also lays groundwork for follow-up patches that will enable end-to-end processing of serialized records.
      
      The new shuffle manager, `UnsafeShuffleManager`, can be enabled by setting `spark.shuffle.manager=tungsten-sort` in SparkConf.
      
      The new shuffle manager uses directly-managed memory to implement several performance optimizations for certain types of shuffles. In cases where the new performance optimizations cannot be applied, the new shuffle manager delegates to SortShuffleManager to handle those shuffles.
      
      UnsafeShuffleManager's optimizations will apply when _all_ of the following conditions hold:
      
       - The shuffle dependency specifies no aggregation or output ordering.
       - The shuffle serializer supports relocation of serialized values (this is currently supported
         by KryoSerializer and Spark SQL's custom serializers).
       - The shuffle produces fewer than 16777216 output partitions.
       - No individual record is larger than 128 MB when serialized.
      
      In addition, extra spill-merging optimizations are automatically applied when the shuffle compression codec supports concatenation of serialized streams. This is currently supported by Spark's LZF serializer.
      
      At a high-level, UnsafeShuffleManager's design is similar to Spark's existing SortShuffleManager.  In sort-based shuffle, incoming records are sorted according to their target partition ids, then written to a single map output file. Reducers fetch contiguous regions of this file in order to read their portion of the map output. In cases where the map output data is too large to fit in memory, sorted subsets of the output can are spilled to disk and those on-disk files are merged to produce the final output file.
      
      UnsafeShuffleManager optimizes this process in several ways:
      
       - Its sort operates on serialized binary data rather than Java objects, which reduces memory consumption and GC overheads. This optimization requires the record serializer to have certain properties to allow serialized records to be re-ordered without requiring deserialization.  See SPARK-4550, where this optimization was first proposed and implemented, for more details.
      
       - It uses a specialized cache-efficient sorter (UnsafeShuffleExternalSorter) that sorts arrays of compressed record pointers and partition ids. By using only 8 bytes of space per record in the sorting array, this fits more of the array into cache.
      
       - The spill merging procedure operates on blocks of serialized records that belong to the same partition and does not need to deserialize records during the merge.
      
       - When the spill compression codec supports concatenation of compressed data, the spill merge simply concatenates the serialized and compressed spill partitions to produce the final output partition.  This allows efficient data copying methods, like NIO's `transferTo`, to be used and avoids the need to allocate decompression or copying buffers during the merge.
      
      The shuffle read path is unchanged.
      
      This patch is similar to [SPARK-4550](http://issues.apache.org/jira/browse/SPARK-4550) / #4450 but uses a slightly different implementation. The `unsafe`-based implementation featured in this patch lays the groundwork for followup patches that will enable sorting to operate on serialized data pages that will be prepared by Spark SQL's new `unsafe` operators (such as the new aggregation operator introduced in #5725).
      
      ### Future work
      
      There are several tasks that build upon this patch, which will be left to future work:
      
      - [SPARK-7271](https://issues.apache.org/jira/browse/SPARK-7271) Redesign / extend the shuffle interfaces to accept binary data as input. The goal here is to let us bypass serialization steps in cases where the sort input is produced by an operator that operates directly on binary data.
      - Extension / redesign of the `Serializer` API. We can add new methods which allow serializers to determine the size requirements for serializing objects and for serializing objects directly to a specified memory address (similar to how `UnsafeRowConverter` works in Spark SQL).
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5868)
      <!-- Reviewable:end -->
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5868 from JoshRosen/unsafe-sort and squashes the following commits:
      
      ef0a86e [Josh Rosen] Fix scalastyle errors
      7610f2f [Josh Rosen] Add tests for proper cleanup of shuffle data.
      d494ffe [Josh Rosen] Fix deserialization of JavaSerializer instances.
      52a9981 [Josh Rosen] Fix some bugs in the address packing code.
      51812a7 [Josh Rosen] Change shuffle manager sort name to tungsten-sort
      4023fa4 [Josh Rosen] Add @Private annotation to some Java classes.
      de40b9d [Josh Rosen] More comments to try to explain metrics code
      df07699 [Josh Rosen] Attempt to clarify confusing metrics update code
      5e189c6 [Josh Rosen] Track time spend closing / flushing files; split TimeTrackingOutputStream into separate file.
      d5779c6 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
      c2ce78e [Josh Rosen] Fix a missed usage of MAX_PARTITION_ID
      e3b8855 [Josh Rosen] Cleanup in UnsafeShuffleWriter
      4a2c785 [Josh Rosen] rename 'sort buffer' to 'pointer array'
      6276168 [Josh Rosen] Remove ability to disable spilling in UnsafeShuffleExternalSorter.
      57312c9 [Josh Rosen] Clarify fileBufferSize units
      2d4e4f4 [Josh Rosen] Address some minor comments in UnsafeShuffleExternalSorter.
      fdcac08 [Josh Rosen] Guard against overflow when expanding sort buffer.
      85da63f [Josh Rosen] Cleanup in UnsafeShuffleSorterIterator.
      0ad34da [Josh Rosen] Fix off-by-one in nextInt() call
      56781a1 [Josh Rosen] Rename UnsafeShuffleSorter to UnsafeShuffleInMemorySorter
      e995d1a [Josh Rosen] Introduce MAX_SHUFFLE_OUTPUT_PARTITIONS.
      e58a6b4 [Josh Rosen] Add more tests for PackedRecordPointer encoding.
      4f0b770 [Josh Rosen] Attempt to implement proper shuffle write metrics.
      d4e6d89 [Josh Rosen] Update to bit shifting constants
      69d5899 [Josh Rosen] Remove some unnecessary override vals
      8531286 [Josh Rosen] Add tests that automatically trigger spills.
      7c953f9 [Josh Rosen] Add test that covers UnsafeShuffleSortDataFormat.swap().
      e1855e5 [Josh Rosen] Fix a handful of misc. IntelliJ inspections
      39434f9 [Josh Rosen] Avoid integer multiplication overflow in getMemoryUsage (thanks FindBugs!)
      1e3ad52 [Josh Rosen] Delete unused ByteBufferOutputStream class.
      ea4f85f [Josh Rosen] Roll back an unnecessary change in Spillable.
      ae538dc [Josh Rosen] Document UnsafeShuffleManager.
      ec6d626 [Josh Rosen] Add notes on maximum # of supported shuffle partitions.
      0d4d199 [Josh Rosen] Bump up shuffle.memoryFraction to make tests pass.
      b3b1924 [Josh Rosen] Properly implement close() and flush() in DummySerializerInstance.
      1ef56c7 [Josh Rosen] Revise compression codec support in merger; test cross product of configurations.
      b57c17f [Josh Rosen] Disable some overly-verbose logs that rendered DEBUG useless.
      f780fb1 [Josh Rosen] Add test demonstrating which compression codecs support concatenation.
      4a01c45 [Josh Rosen] Remove unnecessary log message
      27b18b0 [Josh Rosen] That for inserting records AT the max record size.
      fcd9a3c [Josh Rosen] Add notes + tests for maximum record / page sizes.
      9d1ee7c [Josh Rosen] Fix MiMa excludes for ShuffleWriter change
      fd4bb9e [Josh Rosen] Use own ByteBufferOutputStream rather than Kryo's
      67d25ba [Josh Rosen] Update Exchange operator's copying logic to account for new shuffle manager
      8f5061a [Josh Rosen] Strengthen assertion to check partitioning
      01afc74 [Josh Rosen] Actually read data in UnsafeShuffleWriterSuite
      1929a74 [Josh Rosen] Update to reflect upstream ShuffleBlockManager -> ShuffleBlockResolver rename.
      e8718dd [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
      9b7ebed [Josh Rosen] More defensive programming RE: cleaning up spill files and memory after errors
      7cd013b [Josh Rosen] Begin refactoring to enable proper tests for spilling.
      722849b [Josh Rosen] Add workaround for transferTo() bug in merging code; refactor tests.
      9883e30 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
      b95e642 [Josh Rosen] Refactor and document logic that decides when to spill.
      1ce1300 [Josh Rosen] More minor cleanup
      5e8cf75 [Josh Rosen] More minor cleanup
      e67f1ea [Josh Rosen] Remove upper type bound in ShuffleWriter interface.
      cfe0ec4 [Josh Rosen] Address a number of minor review comments:
      8a6fe52 [Josh Rosen] Rename UnsafeShuffleSpillWriter to UnsafeShuffleExternalSorter
      11feeb6 [Josh Rosen] Update TODOs related to shuffle write metrics.
      b674412 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
      aaea17b [Josh Rosen] Add comments to UnsafeShuffleSpillWriter.
      4f70141 [Josh Rosen] Fix merging; now passes UnsafeShuffleSuite tests.
      133c8c9 [Josh Rosen] WIP towards testing UnsafeShuffleWriter.
      f480fb2 [Josh Rosen] WIP in mega-refactoring towards shuffle-specific sort.
      57f1ec0 [Josh Rosen] WIP towards packed record pointers for use in optimized shuffle sort.
      69232fd [Josh Rosen] Enable compressible address encoding for off-heap mode.
      7ee918e [Josh Rosen] Re-order imports in tests
      3aeaff7 [Josh Rosen] More refactoring and cleanup; begin cleaning iterator interfaces
      3490512 [Josh Rosen] Misc. cleanup
      f156a8f [Josh Rosen] Hacky metrics integration; refactor some interfaces.
      2776aca [Josh Rosen] First passing test for ExternalSorter.
      5e100b2 [Josh Rosen] Super-messy WIP on external sort
      595923a [Josh Rosen] Remove some unused variables.
      8958584 [Josh Rosen] Fix bug in calculating free space in current page.
      f17fa8f [Josh Rosen] Add missing newline
      c2fca17 [Josh Rosen] Small refactoring of SerializerPropertiesSuite to enable test re-use:
      b8a09fe [Josh Rosen] Back out accidental log4j.properties change
      bfc12d3 [Josh Rosen] Add tests for serializer relocation property.
      240864c [Josh Rosen] Remove PrefixComputer and require prefix to be specified as part of insert()
      1433b42 [Josh Rosen] Store record length as int instead of long.
      026b497 [Josh Rosen] Re-use a buffer in UnsafeShuffleWriter
      0748458 [Josh Rosen] Port UnsafeShuffleWriter to Java.
      87e721b [Josh Rosen] Renaming and comments
      d3cc310 [Josh Rosen] Flag that SparkSqlSerializer2 supports relocation
      e2d96ca [Josh Rosen] Expand serializer API and use new function to help control when new UnsafeShuffle path is used.
      e267cee [Josh Rosen] Fix compilation of UnsafeSorterSuite
      9c6cf58 [Josh Rosen] Refactor to use DiskBlockObjectWriter.
      253f13e [Josh Rosen] More cleanup
      8e3ec20 [Josh Rosen] Begin code cleanup.
      4d2f5e1 [Josh Rosen] WIP
      3db12de [Josh Rosen] Minor simplification and sanity checks in UnsafeSorter
      767d3ca [Josh Rosen] Fix invalid range in UnsafeSorter.
      e900152 [Josh Rosen] Add test for empty iterator in UnsafeSorter
      57a4ea0 [Josh Rosen] Make initialSize configurable in UnsafeSorter
      abf7bfe [Josh Rosen] Add basic test case.
      81d52c5 [Josh Rosen] WIP on UnsafeSorter
      73bed408
    • Hari Shreedharan's avatar
      [SPARK-7356] [STREAMING] Fix flakey tests in FlumePollingStreamSuite using... · 61d1e87c
      Hari Shreedharan authored
      [SPARK-7356] [STREAMING] Fix flakey tests in FlumePollingStreamSuite using SparkSink's batch CountDownLatch.
      
      This is meant to make the FlumePollingStreamSuite deterministic. Now we basically count the number of batches that have been completed - and then verify the results rather than sleeping for random periods of time.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #5918 from harishreedharan/flume-test-fix and squashes the following commits:
      
      93f24f3 [Hari Shreedharan] Add an eventually block to ensure that all received data is processed. Refactor the dstream creation and remove redundant code.
      1108804 [Hari Shreedharan] [SPARK-7356][STREAMING] Fix flakey tests in FlumePollingStreamSuite using SparkSink's batch CountDownLatch.
      61d1e87c
    • Andrew Or's avatar
      [STREAMING] [MINOR] Keep streaming.UIUtils private · bb6dec3b
      Andrew Or authored
      zsxwing
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6134 from andrewor14/private-streaming-uiutils and squashes the following commits:
      
      225df94 [Andrew Or] Privatize class
      bb6dec3b
    • Andrew Or's avatar
      [SPARK-7502] DAG visualization: gracefully handle removed stages · aa183787
      Andrew Or authored
      Old stages are removed without much feedback to the user. This happens very often in streaming. See screenshots below for more detail. zsxwing
      
      **Before**
      
      <img src="https://cloud.githubusercontent.com/assets/2133137/7621031/643cc1e0-f978-11e4-8f42-09decaac44a7.png" width="500px"/>
      
      -------------------------
      **After**
      <img src="https://cloud.githubusercontent.com/assets/2133137/7621037/6e37348c-f978-11e4-84a5-e44e154f9b13.png" width="400px"/>
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6132 from andrewor14/dag-viz-remove-gracefully and squashes the following commits:
      
      43175cd [Andrew Or] Handle removed jobs and stages gracefully
      aa183787
    • Andrew Or's avatar
      [SPARK-7464] DAG visualization: highlight the same RDDs on hover · 44403414
      Andrew Or authored
      This is pretty useful for MLlib.
      
      <img src="https://cloud.githubusercontent.com/assets/2133137/7599650/c7d03dd8-f8b8-11e4-8c0a-0a89e786c90f.png" width="400px"/>
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6100 from andrewor14/dag-viz-hover and squashes the following commits:
      
      fefe2af [Andrew Or] Link tooltips for nodes that belong to the same RDD
      90c6a7e [Andrew Or] Assign classes to clusters and nodes, not IDs
      44403414
    • Andrew Or's avatar
      [SPARK-7399] Spark compilation error for scala 2.11 · f88ac701
      Andrew Or authored
      Subsequent fix following #5966. I tried this out locally.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6129 from andrewor14/211-compilation and squashes the following commits:
      
      713868f [Andrew Or] Fix compilation issue for scala 2.11
      f88ac701
    • Andrew Or's avatar
      [SPARK-7608] Clean up old state in RDDOperationGraphListener · f6e18388
      Andrew Or authored
      This is necessary for streaming and long-running Spark applications. zsxwing tdas
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6125 from andrewor14/viz-listener-leak and squashes the following commits:
      
      8660949 [Andrew Or] Fix thing + add tests
      33c0843 [Andrew Or] Clean up old job state
      f6e18388
    • Reynold Xin's avatar
      [SQL] Move some classes into packages that are more appropriate. · e683182c
      Reynold Xin authored
      JavaTypeInference into catalyst
      types.DateUtils into catalyst
      CacheManager into execution
      DefaultParserDialect into catalyst
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6108 from rxin/sql-rename and squashes the following commits:
      
      3fc9613 [Reynold Xin] Fixed import ordering.
      83d9ff4 [Reynold Xin] Fixed codegen tests.
      e271e86 [Reynold Xin] mima
      f4e24a6 [Reynold Xin] [SQL] Move some classes into packages that are more appropriate.
      e683182c
    • scwf's avatar
      [SPARK-7303] [SQL] push down project if possible when the child is sort · 59250fe5
      scwf authored
      Optimize the case of `project(_, sort)` , a example is:
      
      `select key from (select * from testData order by key) t`
      
      before this PR:
      ```
      == Parsed Logical Plan ==
      'Project ['key]
       'Subquery t
        'Sort ['key ASC], true
         'Project [*]
          'UnresolvedRelation [testData], None
      
      == Analyzed Logical Plan ==
      Project [key#0]
       Subquery t
        Sort [key#0 ASC], true
         Project [key#0,value#1]
          Subquery testData
           LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
      
      == Optimized Logical Plan ==
      Project [key#0]
       Sort [key#0 ASC], true
        LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
      
      == Physical Plan ==
      Project [key#0]
       Sort [key#0 ASC], true
        Exchange (RangePartitioning [key#0 ASC], 5), []
         PhysicalRDD [key#0,value#1], MapPartitionsRDD[1]
      ```
      
      after this PR
      ```
      == Parsed Logical Plan ==
      'Project ['key]
       'Subquery t
        'Sort ['key ASC], true
         'Project [*]
          'UnresolvedRelation [testData], None
      
      == Analyzed Logical Plan ==
      Project [key#0]
       Subquery t
        Sort [key#0 ASC], true
         Project [key#0,value#1]
          Subquery testData
           LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
      
      == Optimized Logical Plan ==
      Sort [key#0 ASC], true
       Project [key#0]
        LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
      
      == Physical Plan ==
      Sort [key#0 ASC], true
       Exchange (RangePartitioning [key#0 ASC], 5), []
        Project [key#0]
         PhysicalRDD [key#0,value#1], MapPartitionsRDD[1]
      ```
      
      with this rule we will first do column pruning on the table and then do sorting.
      
      Author: scwf <wangfei1@huawei.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Michael Armbrust <michael@databricks.com>
      
      Closes #5838 from scwf/pruning and squashes the following commits:
      
      b00d833 [scwf] address michael's comment
      e230155 [scwf] fix tests failure
      b09b895 [scwf] improve column pruning
      59250fe5
    • Burak Yavuz's avatar
      [SPARK-7382] [MLLIB] Feature Parity in PySpark for ml.classification · df2fb130
      Burak Yavuz authored
      The missing pieces in ml.classification for Python!
      
      cc mengxr
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #6106 from brkyvz/ml-class and squashes the following commits:
      
      dd78237 [Burak Yavuz] fix style
      1048e29 [Burak Yavuz] ready for PR
      df2fb130
    • leahmcguire's avatar
      [SPARK-7545] [MLLIB] Added check in Bernoulli Naive Bayes to make sure that... · 61e05fc5
      leahmcguire authored
      [SPARK-7545] [MLLIB] Added check in Bernoulli Naive Bayes to make sure that both training and predict features have values of 0 or 1
      
      Author: leahmcguire <lmcguire@salesforce.com>
      
      Closes #6073 from leahmcguire/binaryCheckNB and squashes the following commits:
      
      b8442c2 [leahmcguire] changed to if else for value checks
      911bf83 [leahmcguire] undid reformat
      4eedf1e [leahmcguire] moved bernoulli check
      9ee9e84 [leahmcguire] fixed style error
      3f3b32c [leahmcguire] fixed zero one check so only called in combiner
      831fd27 [leahmcguire] got test working
      f44bb3c [leahmcguire] removed changes from CV branch
      67253f0 [leahmcguire] added check to bernoulli to ensure feature values are zero or one
      f191c71 [leahmcguire] fixed name
      58d060b [leahmcguire] changed param name and test according to comments
      04f0d3c [leahmcguire] Added stats from cross validation as a val in the cross validation model to save them for user access
      61e05fc5
    • Burak Yavuz's avatar
      [SPARK-7593] [ML] Python Api for ml.feature.Bucketizer · 5db18ba6
      Burak Yavuz authored
      Added `ml.feature.Bucketizer` to PySpark.
      
      cc mengxr
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #6124 from brkyvz/ml-bucket and squashes the following commits:
      
      05285be [Burak Yavuz] added sphinx doc
      6abb6ed [Burak Yavuz] added support for Bucketizer
      5db18ba6
    • Tim Ellison's avatar
      [MINOR] [CORE] Accept alternative mesos unsatisfied link error in test. · 51030b8a
      Tim Ellison authored
      The IBM JVM reports an failed library load with a slightly different error message to Oracle's JVM.  Update the test case to allow for either form.
      
      Author: Tim Ellison <tellison@users.noreply.github.com>
      Author: Tim Ellison <t.p.ellison@gmail.com>
      
      Closes #6119 from tellison/LibraryLoading and squashes the following commits:
      
      2c5cd4e [Tim Ellison] Reduce assertion to check for the mesos library name
      f48c194 [Tim Ellison] Split long line
      b1079d7 [Tim Ellison] [MINOR] [CORE] Accept alternative mesos unsatisfied link error in test.
      51030b8a
    • Tim Ellison's avatar
      [MINOR] Enhance SizeEstimator to detect IBM compressed refs and s390 … · 3cd9ad24
      Tim Ellison authored
      …arch.
      
       - zSeries 64-bit Java reports its architecture as s390x, so enhance the 64-bit check to accommodate that value.
      
       - SizeEstimator can detect whether IBM Java is using compressed object pointers using info in the "java.vm.info" property, so will do a better job than failing on the HotSpot MBean and guessing.
      
      Author: Tim Ellison <t.p.ellison@gmail.com>
      
      Closes #6085 from tellison/SizeEstimator and squashes the following commits:
      
      1b6ff6a [Tim Ellison] Merge branch 'master' of https://github.com/apache/spark into SizeEstimator
      0968989 [Tim Ellison] [MINOR] Enhance SizeEstimator to detect IBM compressed refs and s390 arch.
      3cd9ad24
    • Tim Ellison's avatar
      [MINOR] Avoid passing the PermGenSize option to IBM JVMs. · e676fc0c
      Tim Ellison authored
      IBM's Java VM doesn't have the concept of a permgen, so this option shouldn't be passed when the vendor property shows it is an IBM JDK.
      
      Author: Tim Ellison <t.p.ellison@gmail.com>
      Author: Tim Ellison <tellison@users.noreply.github.com>
      
      Closes #6055 from tellison/MaxPermSize and squashes the following commits:
      
      3a0fb66 [Tim Ellison] Convert tabs back to spaces
      6ad4266 [Tim Ellison] Remove unnecessary else clauses to reduce nesting.
      d27174b [Tim Ellison] Merge branch 'master' of https://github.com/apache/spark into MaxPermSize
      42a8c3f [Tim Ellison] [MINOR] Avoid passing the PermGenSize option to IBM JVMs.
      e676fc0c
    • Wenchen Fan's avatar
      [SPARK-7551][DataFrame] support backticks for DataFrame attribute resolution · 213a6f30
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6074 from cloud-fan/7551 and squashes the following commits:
      
      e6f579e [Wenchen Fan] allow space
      2b86699 [Wenchen Fan] handle blank
      e218d99 [Wenchen Fan] address comments
      54c4209 [Wenchen Fan] fix 7551
      213a6f30
    • Cheng Lian's avatar
      [SPARK-7567] [SQL] Migrating Parquet data source to FSBasedRelation · 7ff16e8a
      Cheng Lian authored
      This PR migrates Parquet data source to the newly introduced `FSBasedRelation`. `FSBasedParquetRelation` is created to replace `ParquetRelation2`. Major differences are:
      
      1. Partition discovery code has been factored out to `FSBasedRelation`
      1. `AppendingParquetOutputFormat` is not used now. Instead, an anonymous subclass of `ParquetOutputFormat` is used to handle appending and writing dynamic partitions
      1. When scanning partitioned tables, `FSBasedParquetRelation.buildScan` only builds an `RDD[Row]` for a single selected partition
      1. `FSBasedParquetRelation` doesn't rely on Catalyst expressions for filter push down, thus it doesn't extend `CatalystScan` anymore
      
         After migrating `JSONRelation` (which extends `CatalystScan`), we can remove `CatalystScan`.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6090)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6090 from liancheng/parquet-migration and squashes the following commits:
      
      6063f87 [Cheng Lian] Casts to OutputCommitter rather than FileOutputCommtter
      bfd1cf0 [Cheng Lian] Fixes compilation error introduced while rebasing
      f9ea56e [Cheng Lian] Adds ParquetRelation2 related classes to MiMa check whitelist
      261d8c1 [Cheng Lian] Minor bug fix and more tests
      db65660 [Cheng Lian] Migrates Parquet data source to FSBasedRelation
      7ff16e8a
    • zsxwing's avatar
      [SPARK-7589] [STREAMING] [WEBUI] Make "Input Rate" in the Streaming page... · bec938f7
      zsxwing authored
      [SPARK-7589] [STREAMING] [WEBUI] Make "Input Rate" in the Streaming page consistent with other pages
      
      This PR makes "Input Rate" in the Streaming page consistent with Job and Stage pages.
      
      ![screen shot 2015-05-12 at 5 03 35 pm](https://cloud.githubusercontent.com/assets/1000778/7601444/f943f8ac-f8ca-11e4-8280-a715d814f434.png)
      ![screen shot 2015-05-12 at 5 07 25 pm](https://cloud.githubusercontent.com/assets/1000778/7601445/f9571c0c-f8ca-11e4-9b12-9317cb55c002.png)
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6102 from zsxwing/SPARK-7589 and squashes the following commits:
      
      2745225 [zsxwing] Make "Input Rate" in the Streaming page consistent with other pages
      bec938f7
    • Cheng Hao's avatar
      [SPARK-6734] [SQL] Add UDTF.close support in Generate · 0da254fb
      Cheng Hao authored
      Some third-party UDTF extensions generate additional rows in the "GenericUDTF.close()" method, which is supported / documented by Hive.
      https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF
      However, Spark SQL ignores the "GenericUDTF.close()", and it causes bug while porting job from Hive to Spark SQL.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #5383 from chenghao-intel/udtf_close and squashes the following commits:
      
      98b4e4b [Cheng Hao] Support UDTF.close
      0da254fb
    • Cheng Lian's avatar
      [MINOR] [SQL] Removes debugging println · aa6ba3f2
      Cheng Lian authored
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6123 from liancheng/remove-println and squashes the following commits:
      
      03356b6 [Cheng Lian] Removes debugging println
      aa6ba3f2
    • Yin Huai's avatar
      [SQL] In InsertIntoFSBasedRelation.insert, log cause before abort job/task. · b061bd51
      Yin Huai authored
      We need to add a log entry before calling `abortTask`/`abortJob`. Otherwise, an exception from `abortTask`/`abortJob` will shadow the real cause.
      
      cc liancheng
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6105 from yhuai/logCause and squashes the following commits:
      
      8dfe0d8 [Yin Huai] Log cause.
      b061bd51
    • Cheng Lian's avatar
      [SPARK-7599] [SQL] Don't restrict customized output committers to be... · 10c546e9
      Cheng Lian authored
      [SPARK-7599] [SQL] Don't restrict customized output committers to be subclasses of FileOutputCommitter
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6118 from liancheng/spark-7599 and squashes the following commits:
      
      31e1bd6 [Cheng Lian] Don't restrict customized output committers to be subclasses of FileOutputCommitter
      10c546e9
    • Masayoshi TSUZUKI's avatar
      [SPARK-6568] spark-shell.cmd --jars option does not accept the jar that has space in its path · 50c72708
      Masayoshi TSUZUKI authored
      escape spaces in the arguments.
      
      Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #5447 from tsudukim/feature/SPARK-6568-2 and squashes the following commits:
      
      3f9a188 [Masayoshi TSUZUKI] modified some errors.
      ed46047 [Masayoshi TSUZUKI] avoid scalastyle errors.
      1784239 [Masayoshi TSUZUKI] removed Utils.formatPath.
      e03f289 [Masayoshi TSUZUKI] removed testWindows from Utils.resolveURI and Utils.resolveURIs. replaced SystemUtils.IS_OS_WINDOWS to Utils.isWindows. removed Utils.formatPath from PythonRunner.scala.
      84c33d0 [Masayoshi TSUZUKI] - use resolveURI in nonLocalPaths - run tests for Windows path only on Windows
      016128d [Masayoshi TSUZUKI] fixed to use File.toURI()
      2c62e3b [Masayoshi TSUZUKI] Merge pull request #1 from sarutak/SPARK-6568-2
      7019a8a [Masayoshi TSUZUKI] Merge branch 'master' of https://github.com/apache/spark into feature/SPARK-6568-2
      45946ee [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-6568-2
      10f1c73 [Kousuke Saruta] Added a comment
      93c3c40 [Kousuke Saruta] Merge branch 'classpath-handling-fix' of github.com:sarutak/spark into SPARK-6568-2
      649da82 [Kousuke Saruta] Fix classpath handling
      c7ba6a7 [Masayoshi TSUZUKI] [SPARK-6568] spark-shell.cmd --jars option does not accept the jar that has space in its path
      50c72708
    • linweizhong's avatar
      [SPARK-7526] [SPARKR] Specify ip of RBackend, MonitorServer and RRDD Socket server · 98195c30
      linweizhong authored
      These R process only used to communicate with JVM process on local, so binding to localhost is more reasonable then wildcard ip.
      
      Author: linweizhong <linweizhong@huawei.com>
      
      Closes #6053 from Sephiroth-Lin/spark-7526 and squashes the following commits:
      
      5303af7 [linweizhong] bind to localhost rather than wildcard ip
      98195c30
    • Sun Rui's avatar
      [SPARK-7482] [SPARKR] Rename some DataFrame API methods in SparkR to match... · df9b94a5
      Sun Rui authored
      [SPARK-7482] [SPARKR] Rename some DataFrame API methods in SparkR to match their counterparts in Scala.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #6007 from sun-rui/SPARK-7482 and squashes the following commits:
      
      5c5cf5e [Sun Rui] Implement alias loadDF() as a new function.
      3a30c10 [Sun Rui] Rename load()/save() to read.df()/write.df(). Also add loadDF()/saveDF() as aliases.
      9f569d6 [Sun Rui] [SPARK-7482][SparkR] Rename some DataFrame API methods in SparkR to match their counterparts in Scala.
      df9b94a5
    • Santiago M. Mola's avatar
      [SPARK-7566][SQL] Add type to HiveContext.analyzer · 208b9022
      Santiago M. Mola authored
      This makes HiveContext.analyzer overrideable.
      
      Author: Santiago M. Mola <santi@mola.io>
      
      Closes #6086 from smola/patch-3 and squashes the following commits:
      
      8ece136 [Santiago M. Mola] [SPARK-7566][SQL] Add type to HiveContext.analyzer
      208b9022
  2. May 12, 2015
    • Reynold Xin's avatar
      [SPARK-7321][SQL] Add Column expression for conditional statements (when/otherwise) · 97dee313
      Reynold Xin authored
      This builds on https://github.com/apache/spark/pull/5932 and should close https://github.com/apache/spark/pull/5932 as well.
      
      As an example:
      ```python
      df.select(when(df['age'] == 2, 3).otherwise(4).alias("age")).collect()
      ```
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: kaka1992 <kaka_1992@163.com>
      
      Closes #6072 from rxin/when-expr and squashes the following commits:
      
      8f49201 [Reynold Xin] Throw exception if otherwise is applied twice.
      0455eda [Reynold Xin] Reset run-tests.
      bfb9d9f [Reynold Xin] Updated documentation and test cases.
      762f6a5 [Reynold Xin] Merge pull request #5932 from kaka1992/IFCASE
      95724c6 [kaka1992] Update
      8218d0a [kaka1992] Update
      801009e [kaka1992] Update
      76d6346 [kaka1992] [SPARK-7321][SQL] Add Column expression for conditional statements (if, case)
      97dee313
    • Reynold Xin's avatar
      [SPARK-7588] Document all SQL/DataFrame public methods with @since tag · 8fd55358
      Reynold Xin authored
      This pull request adds since tag to all public methods/classes in SQL/DataFrame to indicate which version the methods/classes were first added.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6101 from rxin/tbc and squashes the following commits:
      
      ed55e11 [Reynold Xin] Add since version to all DataFrame methods.
      8fd55358
    • Patrick Wendell's avatar
      [SPARK-7592] Always set resolution to "Fixed" in PR merge script. · 1b9e434b
      Patrick Wendell authored
      The issue is that the behavior of the ASF JIRA silently
      changed. Now when the "Resolve Issue" transition occurs,
      the default resolution is "Pending Closed". We used to
      count on the default behavior being to set the
      resolution as "Fixed".
      
      The solution is to explicitly set the resolution as "Fixed" and not
      count on default behavior.
      
      Author: Patrick Wendell <patrick@databricks.com>
      
      Closes #6103 from pwendell/merge-script-fix and squashes the following commits:
      
      dcc16a6 [Patrick Wendell] Always set resolution to "Fixed" in PR merge script.
      1b9e434b
    • zsxwing's avatar
      [HOTFIX] Use the old Job API to support old Hadoop versions · 247b7034
      zsxwing authored
      #5526 uses `Job.getInstance`, which does not exist in the old Hadoop versions. Just use `new Job` to replace it.
      
      cc liancheng
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6095 from zsxwing/hotfix and squashes the following commits:
      
      b0c2049 [zsxwing] Use the old Job API to support old Hadoop versions
      247b7034
    • Xiangrui Meng's avatar
      [SPARK-7572] [MLLIB] do not import Param/Params under pyspark.ml · 77f64c73
      Xiangrui Meng authored
      Remove `Param` and `Params` from `pyspark.ml` and add a section in the doc. brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6094 from mengxr/SPARK-7572 and squashes the following commits:
      
      022abd6 [Xiangrui Meng] do not import Param/Params under spark.ml
      77f64c73
    • Tathagata Das's avatar
      [SPARK-7554] [STREAMING] Throw exception when an active/stopped... · 23f7d66d
      Tathagata Das authored
      [SPARK-7554] [STREAMING] Throw exception when an active/stopped StreamingContext is used to create DStreams and output operations
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6099 from tdas/SPARK-7554 and squashes the following commits:
      
      2cd4158 [Tathagata Das] Throw exceptions on attempts to add stuff to active and stopped contexts.
      23f7d66d
    • Xiangrui Meng's avatar
      [SPARK-7528] [MLLIB] make RankingMetrics Java-friendly · 2713bc65
      Xiangrui Meng authored
      `RankingMetrics` contains a ClassTag, which is hard to create in Java. This PR adds a factory method `of` for Java users. coderxiang
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6098 from mengxr/SPARK-7528 and squashes the following commits:
      
      e5d57ae [Xiangrui Meng] make RankingMetrics Java-friendly
      2713bc65
    • Tathagata Das's avatar
      [SPARK-7553] [STREAMING] Added methods to maintain a singleton StreamingContext · 00e7b09a
      Tathagata Das authored
      In a REPL/notebook environment, its very easy to lose a reference to a StreamingContext by overriding the variable name. So if you happen to execute the following commands
      ```
      val ssc = new StreamingContext(...) // cmd 1
      ssc.start() // cmd 2
      ...
      val ssc = new StreamingContext(...) // accidentally run cmd 1 again
      ```
      The value of ssc will be overwritten. Now you can neither start the new context (as only one context can be started), nor stop the previous context (as the reference is lost).
      Hence its best to maintain a singleton reference to the active context, so that we never loose reference for the active context.
      Since this problem occurs useful in REPL environments, its best to add this as an Experimental support in the Scala API only so that it can be used in Scala REPLs and notebooks.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6070 from tdas/SPARK-7553 and squashes the following commits:
      
      731c9a1 [Tathagata Das] Fixed style
      a797171 [Tathagata Das] Added more unit tests
      19fc70b [Tathagata Das] Added :: Experimental :: in docs
      64706c9 [Tathagata Das] Fixed test
      634db5d [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7553
      3884a25 [Tathagata Das] Fixing test bug
      d37a846 [Tathagata Das] Added getActive and getActiveOrCreate
      00e7b09a
    • Joseph K. Bradley's avatar
      [SPARK-7573] [ML] OneVsRest cleanups · 96c4846d
      Joseph K. Bradley authored
      Minor cleanups discussed with [~mengxr]:
      * move OneVsRest from reduction to classification sub-package
      * make model constructor private
      
      Some doc cleanups too
      
      CC: harsha2010  Could you please verify this looks OK?  Thanks!
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6097 from jkbradley/onevsrest-cleanup and squashes the following commits:
      
      4ecd48d [Joseph K. Bradley] org imports
      430b065 [Joseph K. Bradley] moved OneVsRest from reduction subpackage to classification.  small java doc style fixes
      9f8b9b9 [Joseph K. Bradley] Small cleanups to OneVsRest.  Made model constructor private to ml package.
      96c4846d
    • Joseph K. Bradley's avatar
      [SPARK-7557] [ML] [DOC] User guide for spark.ml HashingTF, Tokenizer · f0c1bc34
      Joseph K. Bradley authored
      Added feature transformer subsection to spark.ml guide, with HashingTF and Tokenizer.  Added JavaHashingTFSuite to test Java examples in new guide.
      
      I've run Scala, Python examples in the Spark/PySpark shells.  I ran the Java examples via the test suite (with small modifications for printing).
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6093 from jkbradley/hashingtf-guide and squashes the following commits:
      
      d5d213f [Joseph K. Bradley] small fix
      dd6e91a [Joseph K. Bradley] fixes from code review of user guide
      33c3ff9 [Joseph K. Bradley] small fix
      bc6058c [Joseph K. Bradley] fix link
      361a174 [Joseph K. Bradley] Added subsection for feature transformers to spark.ml guide, with HashingTF and Tokenizer.  Added JavaHashingTFSuite to test Java examples in new guide
      f0c1bc34
    • Yuhao Yang's avatar
      [SPARK-7496] [MLLIB] Update Programming guide with Online LDA · 1d703660
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-7496
      
      Update LDA subsection of clustering section of MLlib programming guide to include OnlineLDA.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #6046 from hhbyyh/ldaDocument and squashes the following commits:
      
      4b6fbfa [Yuhao Yang] add online paper and some comparison
      fd4c983 [Yuhao Yang] update lda document for optimizers
      1d703660
    • zsxwing's avatar
      [SPARK-7406] [STREAMING] [WEBUI] Add tooltips for "Scheduling Delay",... · 1422e79e
      zsxwing authored
      [SPARK-7406] [STREAMING] [WEBUI] Add tooltips for "Scheduling Delay", "Processing Time" and "Total Delay"
      
      Screenshots:
      ![screen shot 2015-05-06 at 2 29 03 pm](https://cloud.githubusercontent.com/assets/1000778/7504129/9c57f710-f3fc-11e4-9c6e-1b79c17c546d.png)
      
      ![screen shot 2015-05-06 at 2 24 35 pm](https://cloud.githubusercontent.com/assets/1000778/7504140/b63bb216-f3fc-11e4-83a5-6dfc6481d192.png)
      
      tdas as we discussed offline
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5952 from zsxwing/SPARK-7406 and squashes the following commits:
      
      2b004ea [zsxwing] Merge branch 'master' into SPARK-7406
      e9eb506 [zsxwing] Update tooltip contents
      2215b2a [zsxwing] Add tooltips for "Scheduling Delay", "Processing Time" and "Total Delay"
      1422e79e
    • Xiangrui Meng's avatar
      [SPARK-7571] [MLLIB] rename Math to math · a4874b0d
      Xiangrui Meng authored
      `scala.Math` is deprecated since 2.8. This PR only touchs `Math` usages in MLlib. dbtsai
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6092 from mengxr/SPARK-7571 and squashes the following commits:
      
      fe8f8d3 [Xiangrui Meng] Math -> math
      a4874b0d
    • Venkata Ramana Gollamudi's avatar
      [SPARK-7484][SQL]Support jdbc connection properties · 455551d1
      Venkata Ramana Gollamudi authored
      Few jdbc drivers like SybaseIQ support passing username and password only through connection properties. So the same needs to be supported for
      SQLContext.jdbc, dataframe.createJDBCTable and dataframe.insertIntoJDBC.
      Added as default arguments or overrided function to support backward compatability.
      
      Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>
      
      Closes #6009 from gvramana/add_jdbc_conn_properties and squashes the following commits:
      
      396a0d0 [Venkata Ramana Gollamudi] fixed comments
      d66dd8c [Venkata Ramana Gollamudi] fixed comments
      1b8cd8c [Venkata Ramana Gollamudi] Support jdbc connection properties
      455551d1
Loading