Skip to content
Snippets Groups Projects
  1. May 22, 2015
    • Michael Armbrust's avatar
      [SPARK-6743] [SQL] Fix empty projections of cached data · 3b68cb04
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #6165 from marmbrus/wrongColumn and squashes the following commits:
      
      4fad158 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into wrongColumn
      aad7eab [Michael Armbrust] rxins comments
      f1e8df1 [Michael Armbrust] [SPARK-6743][SQL] Fix empty projections of cached data
      3b68cb04
  2. May 19, 2015
    • Xiangrui Meng's avatar
      [SPARK-7681] [MLLIB] remove mima excludes for 1.3 · 6845cb2f
      Xiangrui Meng authored
      There excludes are unnecessary for 1.3 because the changes were made in 1.4.x.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6254 from mengxr/SPARK-7681-mima and squashes the following commits:
      
      7f0cea0 [Xiangrui Meng] remove mima excludes for 1.3
      6845cb2f
  3. May 18, 2015
    • Liang-Chi Hsieh's avatar
      [SPARK-7681] [MLLIB] Add SparseVector support for gemv · d03638cc
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7681
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6209 from viirya/sparsevector_gemv and squashes the following commits:
      
      ce0bb8b [Liang-Chi Hsieh] Still need to scal y when beta is 0.0 because it clears out y.
      b890e63 [Liang-Chi Hsieh] Do not delete multiply for DenseVector.
      57a8c1e [Liang-Chi Hsieh] Add MimaExcludes for v1.4.
      458d1ae [Liang-Chi Hsieh] List DenseMatrix.multiply and SparseMatrix.multiply to MimaExcludes too.
      054f05d [Liang-Chi Hsieh] Fix scala style.
      410381a [Liang-Chi Hsieh] Address comments. Make Matrix.multiply more generalized.
      4616696 [Liang-Chi Hsieh] Add support for SparseVector with SparseMatrix.
      5d6d07a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sparsevector_gemv
      c069507 [Liang-Chi Hsieh] Add SparseVector support for gemv with DenseMatrix.
      d03638cc
    • Rene Treffer's avatar
      [SPARK-6888] [SQL] Make the jdbc driver handling user-definable · e1ac2a95
      Rene Treffer authored
      Replace the DriverQuirks with JdbcDialect(s) (and MySQLDialect/PostgresDialect)
      and allow developers to change the dialects on the fly (for new JDBCRRDs only).
      
      Some types (like an unsigned 64bit number) can be trivially mapped to java.
      The status quo is that the RRD will fail to load.
      This patch makes it possible to overwrite the type mapping to read e.g.
      64Bit numbers as strings and handle them afterwards in software.
      
      JDBCSuite has an example that maps all types to String, which should always
      work (at the cost of extra code afterwards).
      
      As a side effect it should now be possible to develop simple dialects
      out-of-tree and even with spark-shell.
      
      Author: Rene Treffer <treffer@measite.de>
      
      Closes #5555 from rtreffer/jdbc-dialects and squashes the following commits:
      
      3cbafd7 [Rene Treffer] [SPARK-6888] ignore classes belonging to changed API in MIMA report
      fe7e2e8 [Rene Treffer] [SPARK-6888] Make the jdbc driver handling user-definable
      e1ac2a95
  4. May 13, 2015
    • Josh Rosen's avatar
      [SPARK-7081] Faster sort-based shuffle path using binary processing cache-aware sort · 73bed408
      Josh Rosen authored
      This patch introduces a new shuffle manager that enhances the existing sort-based shuffle with a new cache-friendly sort algorithm that operates directly on binary data. The goals of this patch are to lower memory usage and Java object overheads during shuffle and to speed up sorting. It also lays groundwork for follow-up patches that will enable end-to-end processing of serialized records.
      
      The new shuffle manager, `UnsafeShuffleManager`, can be enabled by setting `spark.shuffle.manager=tungsten-sort` in SparkConf.
      
      The new shuffle manager uses directly-managed memory to implement several performance optimizations for certain types of shuffles. In cases where the new performance optimizations cannot be applied, the new shuffle manager delegates to SortShuffleManager to handle those shuffles.
      
      UnsafeShuffleManager's optimizations will apply when _all_ of the following conditions hold:
      
       - The shuffle dependency specifies no aggregation or output ordering.
       - The shuffle serializer supports relocation of serialized values (this is currently supported
         by KryoSerializer and Spark SQL's custom serializers).
       - The shuffle produces fewer than 16777216 output partitions.
       - No individual record is larger than 128 MB when serialized.
      
      In addition, extra spill-merging optimizations are automatically applied when the shuffle compression codec supports concatenation of serialized streams. This is currently supported by Spark's LZF serializer.
      
      At a high-level, UnsafeShuffleManager's design is similar to Spark's existing SortShuffleManager.  In sort-based shuffle, incoming records are sorted according to their target partition ids, then written to a single map output file. Reducers fetch contiguous regions of this file in order to read their portion of the map output. In cases where the map output data is too large to fit in memory, sorted subsets of the output can are spilled to disk and those on-disk files are merged to produce the final output file.
      
      UnsafeShuffleManager optimizes this process in several ways:
      
       - Its sort operates on serialized binary data rather than Java objects, which reduces memory consumption and GC overheads. This optimization requires the record serializer to have certain properties to allow serialized records to be re-ordered without requiring deserialization.  See SPARK-4550, where this optimization was first proposed and implemented, for more details.
      
       - It uses a specialized cache-efficient sorter (UnsafeShuffleExternalSorter) that sorts arrays of compressed record pointers and partition ids. By using only 8 bytes of space per record in the sorting array, this fits more of the array into cache.
      
       - The spill merging procedure operates on blocks of serialized records that belong to the same partition and does not need to deserialize records during the merge.
      
       - When the spill compression codec supports concatenation of compressed data, the spill merge simply concatenates the serialized and compressed spill partitions to produce the final output partition.  This allows efficient data copying methods, like NIO's `transferTo`, to be used and avoids the need to allocate decompression or copying buffers during the merge.
      
      The shuffle read path is unchanged.
      
      This patch is similar to [SPARK-4550](http://issues.apache.org/jira/browse/SPARK-4550) / #4450 but uses a slightly different implementation. The `unsafe`-based implementation featured in this patch lays the groundwork for followup patches that will enable sorting to operate on serialized data pages that will be prepared by Spark SQL's new `unsafe` operators (such as the new aggregation operator introduced in #5725).
      
      ### Future work
      
      There are several tasks that build upon this patch, which will be left to future work:
      
      - [SPARK-7271](https://issues.apache.org/jira/browse/SPARK-7271) Redesign / extend the shuffle interfaces to accept binary data as input. The goal here is to let us bypass serialization steps in cases where the sort input is produced by an operator that operates directly on binary data.
      - Extension / redesign of the `Serializer` API. We can add new methods which allow serializers to determine the size requirements for serializing objects and for serializing objects directly to a specified memory address (similar to how `UnsafeRowConverter` works in Spark SQL).
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5868)
      <!-- Reviewable:end -->
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5868 from JoshRosen/unsafe-sort and squashes the following commits:
      
      ef0a86e [Josh Rosen] Fix scalastyle errors
      7610f2f [Josh Rosen] Add tests for proper cleanup of shuffle data.
      d494ffe [Josh Rosen] Fix deserialization of JavaSerializer instances.
      52a9981 [Josh Rosen] Fix some bugs in the address packing code.
      51812a7 [Josh Rosen] Change shuffle manager sort name to tungsten-sort
      4023fa4 [Josh Rosen] Add @Private annotation to some Java classes.
      de40b9d [Josh Rosen] More comments to try to explain metrics code
      df07699 [Josh Rosen] Attempt to clarify confusing metrics update code
      5e189c6 [Josh Rosen] Track time spend closing / flushing files; split TimeTrackingOutputStream into separate file.
      d5779c6 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
      c2ce78e [Josh Rosen] Fix a missed usage of MAX_PARTITION_ID
      e3b8855 [Josh Rosen] Cleanup in UnsafeShuffleWriter
      4a2c785 [Josh Rosen] rename 'sort buffer' to 'pointer array'
      6276168 [Josh Rosen] Remove ability to disable spilling in UnsafeShuffleExternalSorter.
      57312c9 [Josh Rosen] Clarify fileBufferSize units
      2d4e4f4 [Josh Rosen] Address some minor comments in UnsafeShuffleExternalSorter.
      fdcac08 [Josh Rosen] Guard against overflow when expanding sort buffer.
      85da63f [Josh Rosen] Cleanup in UnsafeShuffleSorterIterator.
      0ad34da [Josh Rosen] Fix off-by-one in nextInt() call
      56781a1 [Josh Rosen] Rename UnsafeShuffleSorter to UnsafeShuffleInMemorySorter
      e995d1a [Josh Rosen] Introduce MAX_SHUFFLE_OUTPUT_PARTITIONS.
      e58a6b4 [Josh Rosen] Add more tests for PackedRecordPointer encoding.
      4f0b770 [Josh Rosen] Attempt to implement proper shuffle write metrics.
      d4e6d89 [Josh Rosen] Update to bit shifting constants
      69d5899 [Josh Rosen] Remove some unnecessary override vals
      8531286 [Josh Rosen] Add tests that automatically trigger spills.
      7c953f9 [Josh Rosen] Add test that covers UnsafeShuffleSortDataFormat.swap().
      e1855e5 [Josh Rosen] Fix a handful of misc. IntelliJ inspections
      39434f9 [Josh Rosen] Avoid integer multiplication overflow in getMemoryUsage (thanks FindBugs!)
      1e3ad52 [Josh Rosen] Delete unused ByteBufferOutputStream class.
      ea4f85f [Josh Rosen] Roll back an unnecessary change in Spillable.
      ae538dc [Josh Rosen] Document UnsafeShuffleManager.
      ec6d626 [Josh Rosen] Add notes on maximum # of supported shuffle partitions.
      0d4d199 [Josh Rosen] Bump up shuffle.memoryFraction to make tests pass.
      b3b1924 [Josh Rosen] Properly implement close() and flush() in DummySerializerInstance.
      1ef56c7 [Josh Rosen] Revise compression codec support in merger; test cross product of configurations.
      b57c17f [Josh Rosen] Disable some overly-verbose logs that rendered DEBUG useless.
      f780fb1 [Josh Rosen] Add test demonstrating which compression codecs support concatenation.
      4a01c45 [Josh Rosen] Remove unnecessary log message
      27b18b0 [Josh Rosen] That for inserting records AT the max record size.
      fcd9a3c [Josh Rosen] Add notes + tests for maximum record / page sizes.
      9d1ee7c [Josh Rosen] Fix MiMa excludes for ShuffleWriter change
      fd4bb9e [Josh Rosen] Use own ByteBufferOutputStream rather than Kryo's
      67d25ba [Josh Rosen] Update Exchange operator's copying logic to account for new shuffle manager
      8f5061a [Josh Rosen] Strengthen assertion to check partitioning
      01afc74 [Josh Rosen] Actually read data in UnsafeShuffleWriterSuite
      1929a74 [Josh Rosen] Update to reflect upstream ShuffleBlockManager -> ShuffleBlockResolver rename.
      e8718dd [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
      9b7ebed [Josh Rosen] More defensive programming RE: cleaning up spill files and memory after errors
      7cd013b [Josh Rosen] Begin refactoring to enable proper tests for spilling.
      722849b [Josh Rosen] Add workaround for transferTo() bug in merging code; refactor tests.
      9883e30 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
      b95e642 [Josh Rosen] Refactor and document logic that decides when to spill.
      1ce1300 [Josh Rosen] More minor cleanup
      5e8cf75 [Josh Rosen] More minor cleanup
      e67f1ea [Josh Rosen] Remove upper type bound in ShuffleWriter interface.
      cfe0ec4 [Josh Rosen] Address a number of minor review comments:
      8a6fe52 [Josh Rosen] Rename UnsafeShuffleSpillWriter to UnsafeShuffleExternalSorter
      11feeb6 [Josh Rosen] Update TODOs related to shuffle write metrics.
      b674412 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
      aaea17b [Josh Rosen] Add comments to UnsafeShuffleSpillWriter.
      4f70141 [Josh Rosen] Fix merging; now passes UnsafeShuffleSuite tests.
      133c8c9 [Josh Rosen] WIP towards testing UnsafeShuffleWriter.
      f480fb2 [Josh Rosen] WIP in mega-refactoring towards shuffle-specific sort.
      57f1ec0 [Josh Rosen] WIP towards packed record pointers for use in optimized shuffle sort.
      69232fd [Josh Rosen] Enable compressible address encoding for off-heap mode.
      7ee918e [Josh Rosen] Re-order imports in tests
      3aeaff7 [Josh Rosen] More refactoring and cleanup; begin cleaning iterator interfaces
      3490512 [Josh Rosen] Misc. cleanup
      f156a8f [Josh Rosen] Hacky metrics integration; refactor some interfaces.
      2776aca [Josh Rosen] First passing test for ExternalSorter.
      5e100b2 [Josh Rosen] Super-messy WIP on external sort
      595923a [Josh Rosen] Remove some unused variables.
      8958584 [Josh Rosen] Fix bug in calculating free space in current page.
      f17fa8f [Josh Rosen] Add missing newline
      c2fca17 [Josh Rosen] Small refactoring of SerializerPropertiesSuite to enable test re-use:
      b8a09fe [Josh Rosen] Back out accidental log4j.properties change
      bfc12d3 [Josh Rosen] Add tests for serializer relocation property.
      240864c [Josh Rosen] Remove PrefixComputer and require prefix to be specified as part of insert()
      1433b42 [Josh Rosen] Store record length as int instead of long.
      026b497 [Josh Rosen] Re-use a buffer in UnsafeShuffleWriter
      0748458 [Josh Rosen] Port UnsafeShuffleWriter to Java.
      87e721b [Josh Rosen] Renaming and comments
      d3cc310 [Josh Rosen] Flag that SparkSqlSerializer2 supports relocation
      e2d96ca [Josh Rosen] Expand serializer API and use new function to help control when new UnsafeShuffle path is used.
      e267cee [Josh Rosen] Fix compilation of UnsafeSorterSuite
      9c6cf58 [Josh Rosen] Refactor to use DiskBlockObjectWriter.
      253f13e [Josh Rosen] More cleanup
      8e3ec20 [Josh Rosen] Begin code cleanup.
      4d2f5e1 [Josh Rosen] WIP
      3db12de [Josh Rosen] Minor simplification and sanity checks in UnsafeSorter
      767d3ca [Josh Rosen] Fix invalid range in UnsafeSorter.
      e900152 [Josh Rosen] Add test for empty iterator in UnsafeSorter
      57a4ea0 [Josh Rosen] Make initialSize configurable in UnsafeSorter
      abf7bfe [Josh Rosen] Add basic test case.
      81d52c5 [Josh Rosen] WIP on UnsafeSorter
      73bed408
    • Reynold Xin's avatar
      [SQL] Move some classes into packages that are more appropriate. · e683182c
      Reynold Xin authored
      JavaTypeInference into catalyst
      types.DateUtils into catalyst
      CacheManager into execution
      DefaultParserDialect into catalyst
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6108 from rxin/sql-rename and squashes the following commits:
      
      3fc9613 [Reynold Xin] Fixed import ordering.
      83d9ff4 [Reynold Xin] Fixed codegen tests.
      e271e86 [Reynold Xin] mima
      f4e24a6 [Reynold Xin] [SQL] Move some classes into packages that are more appropriate.
      e683182c
    • Cheng Lian's avatar
      [SPARK-7567] [SQL] Migrating Parquet data source to FSBasedRelation · 7ff16e8a
      Cheng Lian authored
      This PR migrates Parquet data source to the newly introduced `FSBasedRelation`. `FSBasedParquetRelation` is created to replace `ParquetRelation2`. Major differences are:
      
      1. Partition discovery code has been factored out to `FSBasedRelation`
      1. `AppendingParquetOutputFormat` is not used now. Instead, an anonymous subclass of `ParquetOutputFormat` is used to handle appending and writing dynamic partitions
      1. When scanning partitioned tables, `FSBasedParquetRelation.buildScan` only builds an `RDD[Row]` for a single selected partition
      1. `FSBasedParquetRelation` doesn't rely on Catalyst expressions for filter push down, thus it doesn't extend `CatalystScan` anymore
      
         After migrating `JSONRelation` (which extends `CatalystScan`), we can remove `CatalystScan`.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6090)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6090 from liancheng/parquet-migration and squashes the following commits:
      
      6063f87 [Cheng Lian] Casts to OutputCommitter rather than FileOutputCommtter
      bfd1cf0 [Cheng Lian] Fixes compilation error introduced while rebasing
      f9ea56e [Cheng Lian] Adds ParquetRelation2 related classes to MiMa check whitelist
      261d8c1 [Cheng Lian] Minor bug fix and more tests
      db65660 [Cheng Lian] Migrates Parquet data source to FSBasedRelation
      7ff16e8a
  5. May 12, 2015
    • Cheng Lian's avatar
      [SPARK-3928] [SPARK-5182] [SQL] Partitioning support for the data sources API · 0595b6de
      Cheng Lian authored
      This PR adds partitioning support for the external data sources API. It aims to simplify development of file system based data sources, and provide first class partitioning support for both read path and write path.  Existing data sources like JSON and Parquet can be simplified with this work.
      
      ## New features provided
      
      1. Hive compatible partition discovery
      
         This actually generalizes the partition discovery strategy used in Parquet data source in Spark 1.3.0.
      
      1. Generalized partition pruning optimization
      
         Now partition pruning is handled during physical planning phase.  Specific data sources don't need to worry about this harness anymore.
      
         (This also implies that we can remove `CatalystScan` after migrating the Parquet data source, since now we don't need to pass Catalyst expressions to data source implementations.)
      
      1. Insertion with dynamic partitions
      
         When inserting data to a `FSBasedRelation`, data can be partitioned dynamically by specified partition columns.
      
      ## New structures provided
      
      ### Developer API
      
      1. `FSBasedRelation`
      
         Base abstract class for file system based data sources.
      
      1. `OutputWriter`
      
         Base abstract class for output row writers, responsible for writing a single row object.
      
      1. `FSBasedRelationProvider`
      
         A new relation provider for `FSBasedRelation` subclasses. Note that data sources extending `FSBasedRelation` don't need to extend `RelationProvider` and `SchemaRelationProvider`.
      
      ### User API
      
      New overloaded versions of
      
      1. `DataFrame.save()`
      1. `DataFrame.saveAsTable()`
      1. `SQLContext.load()`
      
      are provided to allow users to save/load DataFrames with user defined dynamic partition columns.
      
      ### Spark SQL query planning
      
      1. `InsertIntoFSBasedRelation`
      
         Used to implement write path for `FSBasedRelation`s.
      
      1. New rules for `FSBasedRelation` in `DataSourceStrategy`
      
         These are added to hook `FSBasedRelation` into physical query plan in read path, and perform partition pruning.
      
      ## TODO
      
      - [ ] Use scratch directories when overwriting a table with data selected from itself.
      
            Currently, this is not supported, because the table been overwritten is always deleted before writing any data to it.
      
      - [ ] When inserting with dynamic partition columns, use external sorter to group the data first.
      
            This ensures that we only need to open a single `OutputWriter` at a time.  For data sources like Parquet, `OutputWriter`s can be quite memory consuming.  One issue is that, this approach breaks the row distribution in the original DataFrame.  However, we did't promise to preserve data distribution when writing a DataFrame.
      
      - [x] More tests.  Specifically, test cases for
      
            - [x] Self-join
            - [x] Loading partitioned relations with a subset of partition columns stored in data files.
            - [x] `SQLContext.load()` with user defined dynamic partition columns.
      
      ## Parquet data source migration
      
      Parquet data source migration is covered in PR https://github.com/liancheng/spark/pull/6, which is against this PR branch and for preview only. A formal PR need to be made after this one is merged.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #5526 from liancheng/partitioning-support and squashes the following commits:
      
      5351a1b [Cheng Lian] Fixes compilation error introduced while rebasing
      1f9b1a5 [Cheng Lian] Tweaks data schema passed to FSBasedRelations
      43ba50e [Cheng Lian] Avoids serializing generated projection code
      edf49e7 [Cheng Lian] Removed commented stale code block
      348a922 [Cheng Lian] Adds projection in FSBasedRelation.buildScan(requiredColumns, inputPaths)
      ad4d4de [Cheng Lian] Enables HDFS style globbing
      8d12e69 [Cheng Lian] Fixes compilation error
      c71ac6c [Cheng Lian] Addresses comments from @marmbrus
      7552168 [Cheng Lian] Fixes typo in MimaExclude.scala
      0349e09 [Cheng Lian] Fixes compilation error introduced while rebasing
      52b0c9b [Cheng Lian] Adjusts project/MimaExclude.scala
      c466de6 [Cheng Lian] Addresses comments
      bc3f9b4 [Cheng Lian] Uses projection to separate partition columns and data columns while inserting rows
      795920a [Cheng Lian] Fixes compilation error after rebasing
      0b8cd70 [Cheng Lian] Adds Scala/Catalyst row conversion when writing non-partitioned tables
      fa543f3 [Cheng Lian] Addresses comments
      5849dd0 [Cheng Lian] Fixes doc typos.  Fixes partition discovery refresh.
      51be443 [Cheng Lian] Replaces FSBasedRelation.outputCommitterClass with FSBasedRelation.prepareForWrite
      c4ed4fe [Cheng Lian] Bug fixes and a new test suite
      a29e663 [Cheng Lian] Bug fix: should only pass actuall data files to FSBaseRelation.buildScan
      5f423d3 [Cheng Lian] Bug fixes. Lets data source to customize OutputCommitter rather than OutputFormat
      54c3d7b [Cheng Lian] Enforces that FileOutputFormat must be used
      be0c268 [Cheng Lian] Uses TaskAttempContext rather than Configuration in OutputWriter.init
      0bc6ad1 [Cheng Lian] Resorts to new Hadoop API, and now FSBasedRelation can customize output format class
      f320766 [Cheng Lian] Adds prepareForWrite() hook, refactored writer containers
      422ff4a [Cheng Lian] Fixes style issue
      ce52353 [Cheng Lian] Adds new SQLContext.load() overload with user defined dynamic partition columns
      8d2ff71 [Cheng Lian] Merges partition columns when reading partitioned relations
      ca1805b [Cheng Lian] Removes duplicated partition discovery code in new Parquet
      f18dec2 [Cheng Lian] More strict schema checking
      b746ab5 [Cheng Lian] More tests
      9b487bf [Cheng Lian] Fixes compilation errors introduced while rebasing
      ea6c8dd [Cheng Lian] Removes remote debugging stuff
      327bb1d [Cheng Lian] Implements partitioning support for data sources API
      3c5073a [Cheng Lian] Fixes SaveModes used in test cases
      fb5a607 [Cheng Lian] Fixes compilation error
      9d17607 [Cheng Lian] Adds the contract that OutputWriter should have zero-arg constructor
      5de194a [Cheng Lian] Forgot Apache licence header
      95d0b4d [Cheng Lian] Renames PartitionedSchemaRelationProvider to FSBasedRelationProvider
      770b5ba [Cheng Lian] Adds tests for FSBasedRelation
      3ba9bbf [Cheng Lian] Adds DataFrame.saveAsTable() overrides which support partitioning
      1b8231f [Cheng Lian] Renames FSBasedPrunedFilteredScan to FSBasedRelation
      aa8ba9a [Cheng Lian] Javadoc fix
      012ed2d [Cheng Lian] Adds PartitioningOptions
      7dd8dd5 [Cheng Lian] Adds new interfaces and stub methods for data sources API partitioning support
      0595b6de
    • Marcelo Vanzin's avatar
      [SPARK-7485] [BUILD] Remove pyspark files from assembly. · 82e890fb
      Marcelo Vanzin authored
      The sbt part of the build is hacky; it basically tricks sbt
      into generating the zip by using a generator, but returns
      an empty list for the generated files so that nothing is
      actually added to the assembly.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6022 from vanzin/SPARK-7485 and squashes the following commits:
      
      22c1e04 [Marcelo Vanzin] Remove unneeded code.
      4893622 [Marcelo Vanzin] [SPARK-7485] [build] Remove pyspark files from assembly.
      82e890fb
  6. May 11, 2015
    • Tathagata Das's avatar
      [SPARK-7530] [STREAMING] Added StreamingContext.getState() to expose the... · f9c7580a
      Tathagata Das authored
      [SPARK-7530] [STREAMING] Added StreamingContext.getState() to expose the current state of the context
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6058 from tdas/SPARK-7530 and squashes the following commits:
      
      80ee0e6 [Tathagata Das] STARTED --> ACTIVE
      3da6547 [Tathagata Das] Added synchronized
      dd88444 [Tathagata Das] Added more docs
      e1a8505 [Tathagata Das] Fixed comment length
      89f9980 [Tathagata Das] Change to Java enum and added Java test
      7c57351 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7530
      dd4e702 [Tathagata Das] Addressed comments.
      3d56106 [Tathagata Das] Added Mima excludes
      2b86ba1 [Tathagata Das] Added scala docs.
      1722433 [Tathagata Das] Fixed style
      976b094 [Tathagata Das] Added license
      0585130 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7530
      e0f0a05 [Tathagata Das] Added getState and exposed StreamingContextState
      f9c7580a
  7. May 08, 2015
    • Lianhui Wang's avatar
      [SPARK-6869] [PYSPARK] Add pyspark archives path to PYTHONPATH · ebff7327
      Lianhui Wang authored
      Based on https://github.com/apache/spark/pull/5478 that provide a PYSPARK_ARCHIVES_PATH env. within this PR, we just should export PYSPARK_ARCHIVES_PATH=/user/spark/pyspark.zip,/user/spark/python/lib/py4j-0.8.2.1-src.zip in conf/spark-env.sh when we don't install PySpark on each node of Yarn. i run python application successfully on yarn-client and yarn-cluster with this PR.
      andrewor14 sryza Sephiroth-Lin Can you take a look at this?thanks.
      
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      
      Closes #5580 from lianhuiwang/SPARK-6869 and squashes the following commits:
      
      66ffa43 [Lianhui Wang] Update Client.scala
      c2ad0f9 [Lianhui Wang] Update Client.scala
      1c8f664 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
      008850a [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
      f0b4ed8 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
      150907b [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
      20402cd [Lianhui Wang] use ZipEntry
      9d87c3f [Lianhui Wang] update scala style
      e7bd971 [Lianhui Wang] address vanzin's comments
      4b8a3ed [Lianhui Wang] use pyArchivesEnvOpt
      e6b573b [Lianhui Wang] address vanzin's comments
      f11f84a [Lianhui Wang] zip pyspark archives
      5192cca [Lianhui Wang] update import path
      3b1e4c8 [Lianhui Wang] address tgravescs's comments
      9396346 [Lianhui Wang] put zip to make-distribution.sh
      0d2baf7 [Lianhui Wang] update import paths
      e0179be [Lianhui Wang] add zip pyspark archives in build or sparksubmit
      31e8e06 [Lianhui Wang] update code style
      9f31dac [Lianhui Wang] update code and add comments
      f72987c [Lianhui Wang] add archives path to PYTHONPATH
      ebff7327
  8. May 07, 2015
    • Michael Armbrust's avatar
      [SPARK-6908] [SQL] Use isolated Hive client · cd1d4110
      Michael Armbrust authored
      This PR switches Spark SQL's Hive support to use the isolated hive client interface introduced by #5851, instead of directly interacting with the client.  By using this isolated client we can now allow users to dynamically configure the version of Hive that they are connecting to by setting `spark.sql.hive.metastore.version` without the need recompile.  This also greatly reduces the surface area for our interaction with the hive libraries, hopefully making it easier to support other versions in the future.
      
      Jars for the desired hive version can be configured using `spark.sql.hive.metastore.jars`, which accepts the following options:
       - a colon-separated list of jar files or directories for hive and hadoop.
       - `builtin` - attempt to discover the jars that were used to load Spark SQL and use those. This
                  option is only valid when using the execution version of Hive.
       - `maven` - download the correct version of hive on demand from maven.
      
      By default, `builtin` is used for Hive 13.
      
      This PR also removes the test step for building against Hive 12, as this will no longer be required to talk to Hive 12 metastores.  However, the full removal of the Shim is deferred until a later PR.
      
      Remaining TODOs:
       - Remove the Hive Shims and inline code for Hive 13.
       - Several HiveCompatibility tests are not yet passing.
        - `nullformatCTAS` - As detailed below, we now are handling CTAS parsing ourselves instead of hacking into the Hive semantic analyzer.  However, we currently only handle the common cases and not things like CTAS where the null format is specified.
        - `combine1` now leaks state about compression somehow, breaking all subsequent tests.  As such we currently add it to the blacklist
        - `part_inherit_tbl_props` and `part_inherit_tbl_props_with_star` do not work anymore.  We are correctly propagating the information
        - "load_dyn_part14.*" - These tests pass when run on their own, but fail when run with all other tests.  It seems our `RESET` mechanism may not be as robust as it used to be?
      
      Other required changes:
       -  `CreateTableAsSelect` no longer carries parts of the HiveQL AST with it through the query execution pipeline.  Instead, we parse CTAS during the HiveQL conversion and construct a `HiveTable`.  The full parsing here is not yet complete as detailed above in the remaining TODOs.  Since the operator is Hive specific, it is moved to the hive package.
       - `Command` is simplified to be a trait that simply acts as a marker for a LogicalPlan that should be eagerly evaluated.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5876 from marmbrus/useIsolatedClient and squashes the following commits:
      
      258d000 [Michael Armbrust] really really correct path handling
      e56fd4a [Michael Armbrust] getAbsolutePath
      5a259f5 [Michael Armbrust] fix typos
      81bb366 [Michael Armbrust] comments from vanzin
      5f3945e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
      4b5cd41 [Michael Armbrust] yin's comments
      f5de7de [Michael Armbrust] cleanup
      11e9c72 [Michael Armbrust] better coverage in versions suite
      7e8f010 [Michael Armbrust] better error messages and jar handling
      e7b3941 [Michael Armbrust] more permisive checking for function registration
      da91ba7 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
      5fe5894 [Michael Armbrust] fix serialization suite
      81711c4 [Michael Armbrust] Initial support for running without maven
      1d8ae44 [Michael Armbrust] fix final tests?
      1c50813 [Michael Armbrust] more comments
      a3bee70 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
      a6f5df1 [Michael Armbrust] style
      ab07f7e [Michael Armbrust] WIP
      4d8bf02 [Michael Armbrust] Remove hive 12 compilation
      8843a25 [Michael Armbrust] [SPARK-6908] [SQL] Use isolated Hive client
      cd1d4110
  9. Apr 30, 2015
    • Josh Rosen's avatar
      [Build] Enable MiMa checks for SQL · fa01bec4
      Josh Rosen authored
      Now that 1.3 has been released, we should enable MiMa checks for the `sql` subproject.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5727 from JoshRosen/enable-more-mima-checks and squashes the following commits:
      
      3ad302b [Josh Rosen] Merge remote-tracking branch 'origin/master' into enable-more-mima-checks
      0c48e4d [Josh Rosen] Merge remote-tracking branch 'origin/master' into enable-more-mima-checks
      e276cee [Josh Rosen] Fix SQL MiMa checks via excludes and private[sql]
      44d0d01 [Josh Rosen] Add back 'launcher' exclude
      1aae027 [Josh Rosen] Enable MiMa checks for launcher and sql projects.
      fa01bec4
    • Josh Rosen's avatar
      [SPARK-7288] Suppress compiler warnings due to use of sun.misc.Unsafe; add... · 07a86205
      Josh Rosen authored
      [SPARK-7288] Suppress compiler warnings due to use of sun.misc.Unsafe; add facade in front of Unsafe; remove use of Unsafe.setMemory
      
      This patch suppresses compiler warnings due to our use of `sun.misc.Unsafe` (introduced in #5725).  These warnings can only be suppressed via the `-XDignore.symbol.file` javac flag; the `SuppressWarnings` annotation won't work for these.
      
      In order to restrict uses of this compiler flag to the `unsafe` module, I placed a facade in front of `Unsafe` so that other modules won't call it directly. This facade also will also help us to avoid accidental usage of deprecated Unsafe methods or methods that aren't supported in Java 6.
      
      I also removed an unnecessary use of `Unsafe.setMemory`, which isn't present in certain versions of Java 6, and excluded the new `unsafe` module from Javadoc.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5814 from JoshRosen/unsafe-compiler-warnings-fixes and squashes the following commits:
      
      9e8c483 [Josh Rosen] Exclude new unsafe module from Javadoc
      ba75ecf [Josh Rosen] Only apply -XDignore.symbol.file flag in unsafe project.
      7403345 [Josh Rosen] Put facade in front of Unsafe.
      50230c0 [Josh Rosen] Remove usage of Unsafe.setMemory
      96d41c9 [Josh Rosen] Use -XDignore.symbol.file to suppress warnings about sun.misc.Unsafe usage
      07a86205
    • Joseph K. Bradley's avatar
      [SPARK-7207] [ML] [BUILD] Added ml.recommendation, ml.regression to SparkBuild · adbdb19a
      Joseph K. Bradley authored
      Added ml.recommendation, ml.regression to SparkBuild
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5758 from jkbradley/SPARK-7207 and squashes the following commits:
      
      a28158a [Joseph K. Bradley] Added ml.recommendation, ml.regression to SparkBuild
      adbdb19a
  10. Apr 29, 2015
    • Josh Rosen's avatar
      [SPARK-7076][SPARK-7077][SPARK-7080][SQL] Use managed memory for aggregations · f49284b5
      Josh Rosen authored
      This patch adds managed-memory-based aggregation to Spark SQL / DataFrames. Instead of working with Java objects, this new aggregation path uses `sun.misc.Unsafe` to manipulate raw memory.  This reduces the memory footprint for aggregations, resulting in fewer spills, OutOfMemoryErrors, and garbage collection pauses.  As a result, this allows for higher memory utilization.  It can also result in better cache locality since objects will be stored closer together in memory.
      
      This feature can be eanbled by setting `spark.sql.unsafe.enabled=true`.  For now, this feature is only supported when codegen is enabled and only supports aggregations for which the grouping columns are primitive numeric types or strings and aggregated values are numeric.
      
      ### Managing memory with sun.misc.Unsafe
      
      This patch supports both on- and off-heap managed memory.
      
      - In on-heap mode, memory addresses are identified by the combination of a base Object and an offset within that object.
      - In off-heap mode, memory is addressed directly with 64-bit long addresses.
      
      To support both modes, functions that manipulate memory accept both `baseObject` and `baseOffset` fields.  In off-heap mode, we simply pass `null` as `baseObject`.
      
      We allocate memory in large chunks, so memory fragmentation and allocation speed are not significant bottlenecks.
      
      By default, we use on-heap mode.  To enable off-heap mode, set `spark.unsafe.offHeap=true`.
      
      To track allocated memory, this patch extends `SparkEnv` with an `ExecutorMemoryManager` and supplies each `TaskContext` with a `TaskMemoryManager`.  These classes work together to track allocations and detect memory leaks.
      
      ### Compact tuple format
      
      This patch introduces `UnsafeRow`, a compact row layout.  In this format, each tuple has three parts: a null bit set, fixed length values, and variable-length values:
      
      ![image](https://cloud.githubusercontent.com/assets/50748/7328538/2fdb65ce-ea8b-11e4-9743-6c0f02bb7d1f.png)
      
      - Rows are always 8-byte word aligned (so their sizes will always be a multiple of 8 bytes)
      - The bit set is used for null tracking:
      	- Position _i_ is set if and only if field _i_ is null
      	- The bit set is aligned to an 8-byte word boundary.
      - Every field appears as an 8-byte word in the fixed-length values part:
      	- If a field is null, we zero out the values.
      	- If a field is variable-length, the word stores a relative offset (w.r.t. the base of the tuple) that points to the beginning of the field's data in the variable-length part.
      - Each variable-length data type can have its own encoding:
      	- For strings, the first word stores the length of the string and is followed by UTF-8 encoded bytes.  If necessary, the end of the string is padded with empty bytes in order to ensure word-alignment.
      
      For example, a tuple that consists 3 fields of type (int, string, string), with value (null, “data”, “bricks”) would look like this:
      
      ![image](https://cloud.githubusercontent.com/assets/50748/7328526/1e21959c-ea8b-11e4-9a28-a4350fe4a7b5.png)
      
      This format allows us to compare tuples for equality by directly comparing their raw bytes.  This also enables fast hashing of tuples.
      
      ### Hash map for performing aggregations
      
      This patch introduces `UnsafeFixedWidthAggregationMap`, a hash map for performing aggregations where the aggregation result columns are fixed-with.  This map's keys and values are `Row` objects. `UnsafeFixedWidthAggregationMap` is implemented on top of `BytesToBytesMap`, an append-only map which supports byte-array keys and values.
      
      `BytesToBytesMap` stores pointers to key and value tuples.  For each record with a new key, we copy the key and create the aggregation value buffer for that key and put them in a buffer. The hash table then simply stores pointers to the key and value. For each record with an existing key, we simply run the aggregation function to update the values in place.
      
      This map is implemented using open hashing with triangular sequence probing.  Each entry stores two words in a long array: the first word stores the address of the key and the second word stores the relative offset from the key tuple to the value tuple, as well as the key's 32-bit hashcode.  By storing the full hashcode, we reduce the number of equality checks that need to be performed to handle position collisions ()since the chance of hashcode collision is much lower than position collision).
      
      `UnsafeFixedWidthAggregationMap` allows regular Spark SQL `Row` objects to be used when probing the map.  Internally, it encodes these rows into `UnsafeRow` format using `UnsafeRowConverter`.  This conversion has a small overhead that can be eliminated in the future once we use UnsafeRows in other operators.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5725)
      <!-- Reviewable:end -->
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5725 from JoshRosen/unsafe and squashes the following commits:
      
      eeee512 [Josh Rosen] Add converters for Null, Boolean, Byte, and Short columns.
      81f34f8 [Josh Rosen] Follow 'place children last' convention for GeneratedAggregate
      1bc36cc [Josh Rosen] Refactor UnsafeRowConverter to avoid unnecessary boxing.
      017b2dc [Josh Rosen] Remove BytesToBytesMap.finalize()
      50e9671 [Josh Rosen] Throw memory leak warning even in case of error; add warning about code duplication
      70a39e4 [Josh Rosen] Split MemoryManager into ExecutorMemoryManager and TaskMemoryManager:
      6e4b192 [Josh Rosen] Remove an unused method from ByteArrayMethods.
      de5e001 [Josh Rosen] Fix debug vs. trace in logging message.
      a19e066 [Josh Rosen] Rename unsafe Java test suites to match Scala test naming convention.
      78a5b84 [Josh Rosen] Add logging to MemoryManager
      ce3c565 [Josh Rosen] More comments, formatting, and code cleanup.
      529e571 [Josh Rosen] Measure timeSpentResizing in nanoseconds instead of milliseconds.
      3ca84b2 [Josh Rosen] Only zero the used portion of groupingKeyConversionScratchSpace
      162caf7 [Josh Rosen] Fix test compilation
      b45f070 [Josh Rosen] Don't redundantly store the offset from key to value, since we can compute this from the key size.
      a8e4a3f [Josh Rosen] Introduce MemoryManager interface; add to SparkEnv.
      0925847 [Josh Rosen] Disable MiMa checks for new unsafe module
      cde4132 [Josh Rosen] Add missing pom.xml
      9c19fc0 [Josh Rosen] Add configuration options for heap vs. offheap
      6ffdaa1 [Josh Rosen] Null handling improvements in UnsafeRow.
      31eaabc [Josh Rosen] Lots of TODO and doc cleanup.
      a95291e [Josh Rosen] Cleanups to string handling code
      afe8dca [Josh Rosen] Some Javadoc cleanup
      f3dcbfe [Josh Rosen] More mod replacement
      854201a [Josh Rosen] Import and comment cleanup
      06e929d [Josh Rosen] More warning cleanup
      ef6b3d3 [Josh Rosen] Fix a bunch of FindBugs and IntelliJ inspections
      29a7575 [Josh Rosen] Remove debug logging
      49aed30 [Josh Rosen] More long -> int conversion.
      b26f1d3 [Josh Rosen] Fix bug in murmur hash implementation.
      765243d [Josh Rosen] Enable optional performance metrics for hash map.
      23a440a [Josh Rosen] Bump up default hash map size
      628f936 [Josh Rosen] Use ints intead of longs for indexing.
      92d5a06 [Josh Rosen] Address a number of minor code review comments.
      1f4b716 [Josh Rosen] Merge Unsafe code into the regular GeneratedAggregate, guarded by a configuration flag; integrate planner support and re-enable all tests.
      d85eeff [Josh Rosen] Add basic sanity test for UnsafeFixedWidthAggregationMap
      bade966 [Josh Rosen] Comment update (bumping to refresh GitHub cache...)
      b3eaccd [Josh Rosen] Extract aggregation map into its own class.
      d2bb986 [Josh Rosen] Update to implement new Row methods added upstream
      58ac393 [Josh Rosen] Use UNSAFE allocator in GeneratedAggregate (TODO: make this configurable)
      7df6008 [Josh Rosen] Optimizations related to zeroing out memory:
      c1b3813 [Josh Rosen] Fix bug in UnsafeMemoryAllocator.free():
      738fa33 [Josh Rosen] Add feature flag to guard UnsafeGeneratedAggregate
      c55bf66 [Josh Rosen] Free buffer once iterator has been fully consumed.
      62ab054 [Josh Rosen] Optimize for fact that get() is only called on String columns.
      c7f0b56 [Josh Rosen] Reuse UnsafeRow pointer in UnsafeRowConverter
      ae39694 [Josh Rosen] Add finalizer as "cleanup method of last resort"
      c754ae1 [Josh Rosen] Now that the store*() contract has been stregthened, we can remove an extra lookup
      f764d13 [Josh Rosen] Simplify address + length calculation in Location.
      079f1bf [Josh Rosen] Some clarification of the BytesToBytesMap.lookup() / set() contract.
      1a483c5 [Josh Rosen] First version that passes some aggregation tests:
      fc4c3a8 [Josh Rosen] Sketch how the converters will be used in UnsafeGeneratedAggregate
      53ba9b7 [Josh Rosen] Start prototyping Java Row -> UnsafeRow converters
      1ff814d [Josh Rosen] Add reminder to free memory on iterator completion
      8a8f9df [Josh Rosen] Add skeleton for GeneratedAggregate integration.
      5d55cef [Josh Rosen] Add skeleton for Row implementation.
      f03e9c1 [Josh Rosen] Play around with Unsafe implementations of more string methods.
      ab68e08 [Josh Rosen] Begin merging the UTF8String implementations.
      480a74a [Josh Rosen] Initial import of code from Databricks unsafe utils repo.
      f49284b5
  11. Apr 28, 2015
    • Xiangrui Meng's avatar
      [SPARK-6756] [MLLIB] add toSparse, toDense, numActives, numNonzeros, and compressed to Vector · 5ef006fc
      Xiangrui Meng authored
      Add `compressed` to `Vector` with some other methods: `numActives`, `numNonzeros`, `toSparse`, and `toDense`. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5756 from mengxr/SPARK-6756 and squashes the following commits:
      
      8d4ecbd [Xiangrui Meng] address comment and add mima excludes
      da54179 [Xiangrui Meng] add toSparse, toDense, numActives, numNonzeros, and compressed to Vector
      5ef006fc
  12. Apr 27, 2015
    • Yuhao Yang's avatar
      [SPARK-7090] [MLLIB] Introduce LDAOptimizer to LDA to further improve extensibility · 4d9e560b
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-7090
      
      LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms.
      As Joseph Bradley jkbradley proposed in https://github.com/apache/spark/pull/4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly.
      Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA.
      
      Concrete changes:
      
      1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm.
      
      2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future)
              -adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite.
              -move the code from LDA.initalState to initalState of EMLDAOptimizer
      
      3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer.
      
      4. Change the return type of LDA.run from DistributedLDAModel to LDAModel.
      
      Further work:
      add OnlineLDAOptimizer and other possible Optimizers once ready.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #5661 from hhbyyh/ldaRefactor and squashes the following commits:
      
      0e2e006 [Yuhao Yang] respond to review comments
      08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
      e756ce4 [Yuhao Yang] solve mima exception
      d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
      0bb8400 [Yuhao Yang] refactor LDA with Optimizer
      ec2f857 [Yuhao Yang] protoptype for discussion
      4d9e560b
  13. Apr 17, 2015
    • Ilya Ganelin's avatar
      [SPARK-6703][Core] Provide a way to discover existing SparkContext's · c5ed5101
      Ilya Ganelin authored
      I've added a static getOrCreate method to the static SparkContext object that allows one to either retrieve a previously created SparkContext or to instantiate a new one with the provided config. The method accepts an optional SparkConf to make usage intuitive.
      
      Still working on a test for this, basically want to create a new context from scratch, then ensure that subsequent calls don't overwrite that.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      
      Closes #5501 from ilganeli/SPARK-6703 and squashes the following commits:
      
      db9a963 [Ilya Ganelin] Closing second spark context
      1dc0444 [Ilya Ganelin] Added ref equality check
      8c884fa [Ilya Ganelin] Made getOrCreate synchronized
      cb0c6b7 [Ilya Ganelin] Doc updates and code cleanup
      270cfe3 [Ilya Ganelin] [SPARK-6703] Documentation fixes
      15e8dea [Ilya Ganelin] Updated comments and added MiMa Exclude
      0e1567c [Ilya Ganelin] Got rid of unecessary option for AtomicReference
      dfec4da [Ilya Ganelin] Changed activeContext to AtomicReference
      733ec9f [Ilya Ganelin] Fixed some bugs in test code
      8be2f83 [Ilya Ganelin] Replaced match with if
      e92caf7 [Ilya Ganelin] [SPARK-6703] Added test to ensure that getOrCreate both allows creation, retrieval, and a second context if desired
      a99032f [Ilya Ganelin] Spacing fix
      d7a06b8 [Ilya Ganelin] Updated SparkConf class to add getOrCreate method. Started test suite implementation
      c5ed5101
  14. Apr 14, 2015
    • Marcelo Vanzin's avatar
      [SPARK-5808] [build] Package pyspark files in sbt assembly. · 65774370
      Marcelo Vanzin authored
      This turned out to be more complicated than I wanted because the
      layout of python/ doesn't really follow the usual maven conventions.
      So some extra code is needed to copy just the right things.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5461 from vanzin/SPARK-5808 and squashes the following commits:
      
      7153dac [Marcelo Vanzin] Only try to create resource dir if it doesn't already exist.
      ee90e84 [Marcelo Vanzin] [SPARK-5808] [build] Package pyspark files in sbt assembly.
      65774370
  15. Apr 11, 2015
    • Marcelo Vanzin's avatar
      [hotfix] [build] Make sure JAVA_HOME is set for tests. · 694aef0d
      Marcelo Vanzin authored
      This is needed at least for YARN integration tests, since `$JAVA_HOME` is used to launch the executors.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5441 from vanzin/yarn-test-test and squashes the following commits:
      
      3eeec30 [Marcelo Vanzin] Use JAVA_HOME when available, java.home otherwise.
      d71f1bb [Marcelo Vanzin] And sbt too.
      6bda399 [Marcelo Vanzin] WIP: Testing to see whether this fixes the yarn test issue.
      694aef0d
  16. Apr 09, 2015
    • Yuhao Yang's avatar
      [Spark-6693][MLlib]add tostring with max lines and width for matrix · 9c67049b
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-6693
      
      It's kind of annoying when debugging and found you cannot print out the matrix as you want.
      
      original toString of Matrix only print like following,
      0.17810102596909183    0.5616906241468385    ... (10 total)
      0.9692861997823815     0.015558159784155756  ...
      0.8513015122819192     0.031523763918528847  ...
      0.5396875653953941     0.3267864552779176    ...
      
      The   def toString(maxLines : Int, maxWidth : Int) is useful when debuging, logging and saving matrix to files.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #5344 from hhbyyh/addToString and squashes the following commits:
      
      19a6836 [Yuhao Yang] remove extra line
      6314b21 [Yuhao Yang] add exclude
      736c324 [Yuhao Yang] add ut and exclude
      420da39 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into addToString
      c22f352 [Yuhao Yang] style change
      64a9e0f [Yuhao Yang] add specific to string to matrix
      9c67049b
  17. Apr 07, 2015
    • Reynold Xin's avatar
      [SPARK-6750] Upgrade ScalaStyle to 0.7. · 12322159
      Reynold Xin authored
      0.7 fixes a bug that's pretty useful, i.e. inline functions no longer return explicit type definition.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5399 from rxin/style0.7 and squashes the following commits:
      
      54c41b2 [Reynold Xin] Actually update the version.
      09c759c [Reynold Xin] [SPARK-6750] Upgrade ScalaStyle to 0.7.
      12322159
  18. Apr 03, 2015
    • Ilya Ganelin's avatar
      [SPARK-6492][CORE] SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies · 2c43ea38
      Ilya Ganelin authored
      I've added a timeout and retry loop around the SparkContext shutdown code that should fix this deadlock. If a SparkContext shutdown is in progress when another thread comes knocking, it will wait for 10 seconds for the lock, then fall through where the outer loop will re-submit the request.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      
      Closes #5277 from ilganeli/SPARK-6492 and squashes the following commits:
      
      8617a7e [Ilya Ganelin] Resolved merge conflict
      2fbab66 [Ilya Ganelin] Added MIMA Exclude
      a0e2c70 [Ilya Ganelin] Deleted stale imports
      fa28ce7 [Ilya Ganelin] reverted to just having a single stopped
      76fc825 [Ilya Ganelin] Updated to use atomic booleans instead of the synchronized vars
      6e8a7f7 [Ilya Ganelin] Removing unecessary null check for now since i'm not fixing stop ordering yet
      cdf7073 [Ilya Ganelin] [SPARK-6492] Moved stopped=true back to the start of the shutdown sequence so this can be addressed in a seperate PR
      7fb795b [Ilya Ganelin] Spacing
      b7a0c5c [Ilya Ganelin] Import ordering
      df8224f [Ilya Ganelin] Added comment for added lock
      343cb94 [Ilya Ganelin] [SPARK-6492] Added timeout/retry logic to fix a deadlock in SparkContext shutdown
      2c43ea38
  19. Apr 01, 2015
    • Ilya Ganelin's avatar
      [SPARK-4655][Core] Split Stage into ShuffleMapStage and ResultStage subclasses · ff1915e1
      Ilya Ganelin authored
      Hi all - this patch changes the Stage class to an abstract class and introduces two new classes that extend it: ShuffleMapStage and ResultStage - with the goal of increasing readability of the DAGScheduler class. Their usage is updated within DAGScheduler.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      Author: Ilya Ganelin <ilganeli@gmail.com>
      
      Closes #4708 from ilganeli/SPARK-4655 and squashes the following commits:
      
      c248924 [Ilya Ganelin] Merge branch 'SPARK-4655' of github.com:ilganeli/spark into SPARK-4655
      d930385 [Ilya Ganelin] Fixed merge conflict from
      a9a765f [Ilya Ganelin] Update DAGScheduler.scala
      c03563c [Ilya Ganelin] Minor fixeS
      c39e971 [Ilya Ganelin] Added return typing for public methods
      845bc87 [Ilya Ganelin] Merge branch 'SPARK-4655' of github.com:ilganeli/spark into SPARK-4655
      e8031d8 [Ilya Ganelin] Minor string fixes
      4ec53ac [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-4655
      c004f62 [Ilya Ganelin] Update DAGScheduler.scala
      a2cb03f [Ilya Ganelin] [SPARK-4655] Replaced usages of Nil and eliminated some code reuse
      3d5cf20 [Ilya Ganelin] [SPARK-4655] Moved mima exclude to 1.4
      6912c55 [Ilya Ganelin] Resolved merge conflict
      4bff208 [Ilya Ganelin] Minor stylistic fixes
      c6fffbb [Ilya Ganelin] newline
      41402ad [Ilya Ganelin] Style fixes
      02c6981 [Ilya Ganelin] Merge branch 'SPARK-4655' of github.com:ilganeli/spark into SPARK-4655
      c755a09 [Ilya Ganelin] Some more stylistic updates and minor refactoring
      b6257a0 [Ilya Ganelin] Update MimaExcludes.scala
      0f0c624 [Ilya Ganelin] Fixed merge conflict
      2eba262 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-4655
      6b43d7b [Ilya Ganelin] Got rid of some spaces
      6f1a5db [Ilya Ganelin] Revert "More minor formatting and refactoring"
      1b3471b [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-4655
      c9288e2 [Ilya Ganelin] More minor formatting and refactoring
      d548caf [Ilya Ganelin] Formatting fix
      c3ae5c2 [Ilya Ganelin] Explicit typing
      0dacaf3 [Ilya Ganelin] Got rid of stale import
      6da3a71 [Ilya Ganelin] Trailing whitespace
      b85c5fe [Ilya Ganelin] Added minor fixes
      a57dfcd [Ilya Ganelin] Added MiMA exclusion to get around binary compatibility check
      83ed849 [Ilya Ganelin] moved braces for consistency
      96dd161 [Ilya Ganelin] Fixed minor style error
      cfd6f10 [Ilya Ganelin] Updated DAGScheduler to use new ResultStage and ShuffleMapStage classes
      83494e9 [Ilya Ganelin] Added new Stage classes
      ff1915e1
  20. Mar 30, 2015
    • CodingCat's avatar
      [SPARK-6592][SQL] fix filter for scaladoc to generate API doc for Row class under catalyst dir · 32259c67
      CodingCat authored
      https://issues.apache.org/jira/browse/SPARK-6592
      
      The current impl in SparkBuild.scala filter all classes under catalyst directory, however, we have a corner case that Row class is a public API under that directory
      
      we need to include Row into the scaladoc while still excluding other classes of catalyst project
      
      Thanks for the help on this patch from rxin and liancheng
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #5252 from CodingCat/SPARK-6592 and squashes the following commits:
      
      02098a4 [CodingCat] ignore collection, enable types (except those protected classes)
      f7af2cb [CodingCat] commit
      3ab4403 [CodingCat] fix filter for scaladoc to generate API doc for Row.scala under catalyst directory
      32259c67
  21. Mar 29, 2015
    • zsxwing's avatar
      [SPARK-5124][Core] A standard RPC interface and an Akka implementation · a8d53afb
      zsxwing authored
      This PR added a standard internal RPC interface for Spark and an Akka implementation. See [the design document](https://issues.apache.org/jira/secure/attachment/12698710/Pluggable%20RPC%20-%20draft%202.pdf) for more details.
      
      I will split the whole work into multiple PRs to make it easier for code review. This is the first PR and avoid to touch too many files.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #4588 from zsxwing/rpc-part1 and squashes the following commits:
      
      fe3df4c [zsxwing] Move registerEndpoint and use actorSystem.dispatcher in asyncSetupEndpointRefByURI
      f6f3287 [zsxwing] Remove RpcEndpointRef.toURI
      8bd1097 [zsxwing] Fix docs and the code style
      f459380 [zsxwing] Add RpcAddress.fromURI and rename urls to uris
      b221398 [zsxwing] Move send methods above ask methods
      15cfd7b [zsxwing] Merge branch 'master' into rpc-part1
      9ffa997 [zsxwing] Fix MiMa tests
      78a1733 [zsxwing] Merge remote-tracking branch 'origin/master' into rpc-part1
      385b9c3 [zsxwing] Fix the code style and add docs
      2cc3f78 [zsxwing] Add an asynchronous version of setupEndpointRefByUrl
      e8dfec3 [zsxwing] Remove 'sendWithReply(message: Any, sender: RpcEndpointRef): Unit'
      08564ae [zsxwing] Add RpcEnvFactory to create RpcEnv
      e5df4ca [zsxwing] Handle AkkaFailure(e) in Actor
      ec7c5b0 [zsxwing] Fix docs
      7fc95e1 [zsxwing] Implement askWithReply in RpcEndpointRef
      9288406 [zsxwing] Document thread-safety for setupThreadSafeEndpoint
      3007c09 [zsxwing] Move setupDriverEndpointRef to RpcUtils and rename to makeDriverRef
      c425022 [zsxwing] Fix the code style
      5f87700 [zsxwing] Move the logical of processing message to a private function
      3e56123 [zsxwing] Use lazy to eliminate CountDownLatch
      07f128f [zsxwing] Remove ActionScheduler.scala
      4d34191 [zsxwing] Remove scheduler from RpcEnv
      7cdd95e [zsxwing] Add docs for RpcEnv
      51e6667 [zsxwing] Add 'sender' to RpcCallContext and rename the parameter of receiveAndReply to 'context'
      ffc1280 [zsxwing] Rename 'fail' to 'sendFailure' and other minor code style changes
      28e6d0f [zsxwing] Add onXXX for network events and remove the companion objects of network events
      3751c97 [zsxwing] Rename RpcResponse to RpcCallContext
      fe7d1ff [zsxwing] Add explicit reply in rpc
      7b9e0c9 [zsxwing] Fix the indentation
      04a106e [zsxwing] Remove NopCancellable and add a const NOP in object SettableCancellable
      2a579f4 [zsxwing] Remove RpcEnv.systemName
      155b987 [zsxwing] Change newURI to uriOf and add some comments
      45b2317 [zsxwing] A standard RPC interface and An Akka implementation
      a8d53afb
  22. Mar 26, 2015
    • Brennon York's avatar
      [SPARK-6510][GraphX]: Add Graph#minus method to act as Set#difference · 39fb5796
      Brennon York authored
      Adds a `Graph#minus` method which will return only unique `VertexId`'s from the calling `VertexRDD`.
      
      To demonstrate a basic example with pseudocode:
      
      ```
      Set((0L,0),(1L,1)).minus(Set((1L,1),(2L,2)))
      > Set((0L,0))
      ```
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #5175 from brennonyork/SPARK-6510 and squashes the following commits:
      
      248d5c8 [Brennon York] added minus(VertexRDD[VD]) method to avoid createUsingIndex and updated the mask operations to simplify with andNot call
      3fb7cce [Brennon York] updated graphx doc to reflect the addition of minus method
      6575d92 [Brennon York] updated mima exclude
      aaa030b [Brennon York] completed graph#minus functionality
      7227c0f [Brennon York] beginning work on minus functionality
      39fb5796
  23. Mar 24, 2015
    • Reynold Xin's avatar
      [SPARK-6428] Added explicit types for all public methods in core. · 4ce2782a
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5125 from rxin/core-explicit-type and squashes the following commits:
      
      f471415 [Reynold Xin] Revert style checker changes.
      81b66e4 [Reynold Xin] Code review feedback.
      a7533e3 [Reynold Xin] Mima excludes.
      1d795f5 [Reynold Xin] [SPARK-6428] Added explicit types for all public methods in core.
      4ce2782a
  24. Mar 20, 2015
    • Marcelo Vanzin's avatar
      [SPARK-6371] [build] Update version to 1.4.0-SNAPSHOT. · a7456459
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5056 from vanzin/SPARK-6371 and squashes the following commits:
      
      63220df [Marcelo Vanzin] Merge branch 'master' into SPARK-6371
      6506f75 [Marcelo Vanzin] Use more fine-grained exclusion.
      178ba71 [Marcelo Vanzin] Oops.
      75b2375 [Marcelo Vanzin] Exclude VertexRDD in MiMA.
      a45a62c [Marcelo Vanzin] Work around MIMA warning.
      1d8a670 [Marcelo Vanzin] Re-group jetty exclusion.
      0e8e909 [Marcelo Vanzin] Ignore ml, don't ignore graphx.
      cef4603 [Marcelo Vanzin] Indentation.
      296cf82 [Marcelo Vanzin] [SPARK-6371] [build] Update version to 1.4.0-SNAPSHOT.
      a7456459
  25. Mar 16, 2015
    • Brennon York's avatar
      [SPARK-5922][GraphX]: Add diff(other: RDD[VertexId, VD]) in VertexRDD · 45f4c661
      Brennon York authored
      Changed method invocation of 'diff' to match that of 'innerJoin' and 'leftJoin' from VertexRDD[VD] to RDD[(VertexId, VD)]. This change maintains backwards compatibility and better unifies the VertexRDD methods to match each other.
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #4733 from brennonyork/SPARK-5922 and squashes the following commits:
      
      e800f08 [Brennon York] fixed merge conflicts
      b9274af [Brennon York] fixed merge conflicts
      f86375c [Brennon York] fixed minor include line
      398ddb4 [Brennon York] fixed merge conflicts
      aac1810 [Brennon York] updated to aggregateUsingIndex and added test to ensure that method works properly
      2af0b88 [Brennon York] removed deprecation line
      753c963 [Brennon York] fixed merge conflicts and set preference to use the diff(other: VertexRDD[VD]) method
      2c678c6 [Brennon York] added mima exclude to exclude new public diff method from VertexRDD
      93186f3 [Brennon York] added back the original diff method to sustain binary compatibility
      f18356e [Brennon York] changed method invocation of 'diff' to match that of 'innerJoin' and 'leftJoin' from VertexRDD[VD] to RDD[(VertexId, VD)]
      45f4c661
  26. Mar 15, 2015
    • OopsOutOfMemory's avatar
      [SPARK-6285][SQL]Remove ParquetTestData in SparkBuild.scala and in README.md · 62ede538
      OopsOutOfMemory authored
      This is a following clean up PR for #5010
      This will resolve issues when launching `hive/console` like below:
      ```
      <console>:20: error: object ParquetTestData is not a member of package org.apache.spark.sql.parquet
             import org.apache.spark.sql.parquet.ParquetTestData
      ```
      
      Author: OopsOutOfMemory <victorshengli@126.com>
      
      Closes #5032 from OopsOutOfMemory/SPARK-6285 and squashes the following commits:
      
      2996aeb [OopsOutOfMemory] remove ParquetTestData
      62ede538
  27. Mar 13, 2015
    • vinodkc's avatar
      [SPARK-6317][SQL]Fixed HIVE console startup issue · e360d5e4
      vinodkc authored
      Author: vinodkc <vinod.kc.in@gmail.com>
      Author: Vinod K C <vinod.kc@huawei.com>
      
      Closes #5011 from vinodkc/HIVE_console_startupError and squashes the following commits:
      
      b43925f [vinodkc] Changed order of import
      b4f5453 [Vinod K C] Fixed HIVE console startup issue
      e360d5e4
  28. Mar 12, 2015
    • Xiangrui Meng's avatar
      [SPARK-4588] ML Attributes · a4b27162
      Xiangrui Meng authored
      This continues the work in #4460 from srowen . The design doc is published on the JIRA page with some minor changes.
      
      Short description of ML attributes: https://github.com/apache/spark/pull/4925/files?diff=unified#diff-95e7f5060429f189460b44a3f8731a35R24
      
      More details can be found in the design doc.
      
      srowen Could you help review this PR? There are many lines but most of them are boilerplate code.
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4925 from mengxr/SPARK-4588-new and squashes the following commits:
      
      71d1bd0 [Xiangrui Meng] add JavaDoc for package ml.attribute
      617be40 [Xiangrui Meng] remove final; rename cardinality to numValues
      393ffdc [Xiangrui Meng] forgot to include Java attribute group tests
      b1aceef [Xiangrui Meng] more tests
      e7ab467 [Xiangrui Meng] update ML attribute impl
      7c944da [Sean Owen] Add FeatureType hierarchy and categorical cardinality
      2a21d6d [Sean Owen] Initial draft of FeatureAttributes class
      a4b27162
    • Xiangrui Meng's avatar
      [SPARK-5814][MLLIB][GRAPHX] Remove JBLAS from runtime · 0cba802a
      Xiangrui Meng authored
      The issue is discussed in https://issues.apache.org/jira/browse/SPARK-5669. Replacing all JBLAS usage by netlib-java gives us a simpler dependency tree and less license issues to worry about. I didn't touch the test scope in this PR. The user guide is not modified to avoid merge conflicts with branch-1.3. srowen ankurdave pwendell
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4699 from mengxr/SPARK-5814 and squashes the following commits:
      
      48635c6 [Xiangrui Meng] move netlib-java version to parent pom
      ca21c74 [Xiangrui Meng] remove jblas from ml-guide
      5f7767a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5814
      c5c4183 [Xiangrui Meng] merge master
      0f20cad [Xiangrui Meng] add mima excludes
      e53e9f4 [Xiangrui Meng] remove jblas from mllib runtime
      ceaa14d [Xiangrui Meng] replace jblas by netlib-java in graphx
      fa7c2ca [Xiangrui Meng] move jblas to test scope
      0cba802a
  29. Mar 11, 2015
    • Marcelo Vanzin's avatar
      [SPARK-4924] Add a library for launching Spark jobs programmatically. · 517975d8
      Marcelo Vanzin authored
      This change encapsulates all the logic involved in launching a Spark job
      into a small Java library that can be easily embedded into other applications.
      
      The overall goal of this change is twofold, as described in the bug:
      
      - Provide a public API for launching Spark processes. This is a common request
        from users and currently there's no good answer for it.
      
      - Remove a lot of the duplicated code and other coupling that exists in the
        different parts of Spark that deal with launching processes.
      
      A lot of the duplication was due to different code needed to build an
      application's classpath (and the bootstrapper needed to run the driver in
      certain situations), and also different code needed to parse spark-submit
      command line options in different contexts. The change centralizes those
      as much as possible so that all code paths can rely on the library for
      handling those appropriately.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #3916 from vanzin/SPARK-4924 and squashes the following commits:
      
      18c7e4d [Marcelo Vanzin] Fix make-distribution.sh.
      2ce741f [Marcelo Vanzin] Add lots of quotes.
      3b28a75 [Marcelo Vanzin] Update new pom.
      a1b8af1 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      897141f [Marcelo Vanzin] Review feedback.
      e2367d2 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      28cd35e [Marcelo Vanzin] Remove stale comment.
      b1d86b0 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      00505f9 [Marcelo Vanzin] Add blurb about new API in the programming guide.
      5f4ddcc [Marcelo Vanzin] Better usage messages.
      92a9cfb [Marcelo Vanzin] Fix Win32 launcher, usage.
      6184c07 [Marcelo Vanzin] Rename field.
      4c19196 [Marcelo Vanzin] Update comment.
      7e66c18 [Marcelo Vanzin] Fix pyspark tests.
      0031a8e [Marcelo Vanzin] Review feedback.
      c12d84b [Marcelo Vanzin] Review feedback. And fix spark-submit on Windows.
      e2d4d71 [Marcelo Vanzin] Simplify some code used to launch pyspark.
      43008a7 [Marcelo Vanzin] Don't make builder extend SparkLauncher.
      b4d6912 [Marcelo Vanzin] Use spark-submit script in SparkLauncher.
      28b1434 [Marcelo Vanzin] Add a comment.
      304333a [Marcelo Vanzin] Fix propagation of properties file arg.
      bb67b93 [Marcelo Vanzin] Remove unrelated Yarn change (that is also wrong).
      8ec0243 [Marcelo Vanzin] Add missing newline.
      95ddfa8 [Marcelo Vanzin] Fix handling of --help for spark-class command builder.
      72da7ec [Marcelo Vanzin] Rename SparkClassLauncher.
      62978e4 [Marcelo Vanzin] Minor cleanup of Windows code path.
      9cd5b44 [Marcelo Vanzin] Make all non-public APIs package-private.
      e4c80b6 [Marcelo Vanzin] Reorganize the code so that only SparkLauncher is public.
      e50dc5e [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      de81da2 [Marcelo Vanzin] Fix CommandUtils.
      86a87bf [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      2061967 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      46d46da [Marcelo Vanzin] Clean up a test and make it more future-proof.
      b93692a [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      ad03c48 [Marcelo Vanzin] Revert "Fix a thread-safety issue in "local" mode."
      0b509d0 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      23aa2a9 [Marcelo Vanzin] Read java-opts from conf dir, not spark home.
      7cff919 [Marcelo Vanzin] Javadoc updates.
      eae4d8e [Marcelo Vanzin] Fix new unit tests on Windows.
      e570fb5 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      44cd5f7 [Marcelo Vanzin] Add package-info.java, clean up javadocs.
      f7cacff [Marcelo Vanzin] Remove "launch Spark in new thread" feature.
      7ed8859 [Marcelo Vanzin] Some more feedback.
      54cd4fd [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      61919df [Marcelo Vanzin] Clean leftover debug statement.
      aae5897 [Marcelo Vanzin] Use launcher classes instead of jars in non-release mode.
      e584fc3 [Marcelo Vanzin] Rework command building a little bit.
      525ef5b [Marcelo Vanzin] Rework Unix spark-class to handle argument with newlines.
      8ac4e92 [Marcelo Vanzin] Minor test cleanup.
      e946a99 [Marcelo Vanzin] Merge PySparkLauncher into SparkSubmitCliLauncher.
      c617539 [Marcelo Vanzin] Review feedback round 1.
      fc6a3e2 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      f26556b [Marcelo Vanzin] Fix a thread-safety issue in "local" mode.
      2f4e8b4 [Marcelo Vanzin] Changes needed to make this work with SPARK-4048.
      799fc20 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      bb5d324 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      53faef1 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      a7936ef [Marcelo Vanzin] Fix pyspark tests.
      656374e [Marcelo Vanzin] Mima fixes.
      4d511e7 [Marcelo Vanzin] Fix tools search code.
      7a01e4a [Marcelo Vanzin] Fix pyspark on Yarn.
      1b3f6e9 [Marcelo Vanzin] Call SparkSubmit from spark-class launcher for unknown classes.
      25c5ae6 [Marcelo Vanzin] Centralize SparkSubmit command line parsing.
      27be98a [Marcelo Vanzin] Modify Spark to use launcher lib.
      6f70eea [Marcelo Vanzin] [SPARK-4924] Add a library for launching Spark jobs programatically.
      517975d8
  30. Mar 03, 2015
    • Reynold Xin's avatar
      [SPARK-5310][SQL] Fixes to Docs and Datasources API · 54d19689
      Reynold Xin authored
       - Various Fixes to docs
       - Make data source traits actually interfaces
      
      Based on #4862 but with fixed conflicts.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4868 from marmbrus/pr/4862 and squashes the following commits:
      
      fe091ea [Michael Armbrust] Merge remote-tracking branch 'origin/master' into pr/4862
      0208497 [Reynold Xin] Test fixes.
      34e0a28 [Reynold Xin] [SPARK-5310][SQL] Various fixes to Spark SQL docs.
      54d19689
  31. Feb 19, 2015
    • Sean Owen's avatar
      SPARK-4682 [CORE] Consolidate various 'Clock' classes · 34b7c353
      Sean Owen authored
      Another one from JoshRosen 's wish list. The first commit is much smaller and removes 2 of the 4 Clock classes. The second is much larger, necessary for consolidating the streaming one. I put together implementations in the way that seemed simplest. Almost all the change is standardizing class and method names.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4514 from srowen/SPARK-4682 and squashes the following commits:
      
      5ed3a03 [Sean Owen] Javadoc Clock classes; make ManualClock private[spark]
      169dd13 [Sean Owen] Add support for legacy org.apache.spark.streaming clock class names
      277785a [Sean Owen] Reduce the net change in this patch by reversing some unnecessary syntax changes along the way
      b5e53df [Sean Owen] FakeClock -> ManualClock; getTime() -> getTimeMillis()
      160863a [Sean Owen] Consolidate Streaming Clock class into common util Clock
      7c956b2 [Sean Owen] Consolidate Clocks except for Streaming Clock
      34b7c353
  32. Feb 17, 2015
    • Michael Armbrust's avatar
      [SPARK-5166][SPARK-5247][SPARK-5258][SQL] API Cleanup / Documentation · c74b07fa
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4642 from marmbrus/docs and squashes the following commits:
      
      d291c34 [Michael Armbrust] python tests
      9be66e3 [Michael Armbrust] comments
      d56afc2 [Michael Armbrust] fix style
      f004747 [Michael Armbrust] fix build
      c4a907b [Michael Armbrust] fix tests
      42e2b73 [Michael Armbrust] [SQL] Documentation / API Clean-up.
      c74b07fa
  33. Feb 09, 2015
    • Marcelo Vanzin's avatar
      [SPARK-2996] Implement userClassPathFirst for driver, yarn. · 20a60131
      Marcelo Vanzin authored
      Yarn's config option `spark.yarn.user.classpath.first` does not work the same way as
      `spark.files.userClassPathFirst`; Yarn's version is a lot more dangerous, in that it
      modifies the system classpath, instead of restricting the changes to the user's class
      loader. So this change implements the behavior of the latter for Yarn, and deprecates
      the more dangerous choice.
      
      To be able to achieve feature-parity, I also implemented the option for drivers (the existing
      option only applies to executors). So now there are two options, each controlling whether
      to apply userClassPathFirst to the driver or executors. The old option was deprecated, and
      aliased to the new one (`spark.executor.userClassPathFirst`).
      
      The existing "child-first" class loader also had to be fixed. It didn't handle resources, and it
      was also doing some things that ended up causing JVM errors depending on how things
      were being called.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #3233 from vanzin/SPARK-2996 and squashes the following commits:
      
      9cf9cf1 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      a1499e2 [Marcelo Vanzin] Remove SPARK_HOME propagation.
      fa7df88 [Marcelo Vanzin] Remove 'test.resource' file, create it dynamically.
      a8c69f1 [Marcelo Vanzin] Review feedback.
      cabf962 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      a1b8d7e [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      3f768e3 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      2ce3c7a [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      0e6d6be [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      70d4044 [Marcelo Vanzin] Fix pyspark/yarn-cluster test.
      0fe7777 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      0e6ef19 [Marcelo Vanzin] Move class loaders around and make names more meaninful.
      fe970a7 [Marcelo Vanzin] Review feedback.
      25d4fed [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      3cb6498 [Marcelo Vanzin] Call the right loadClass() method on the parent.
      fbb8ab5 [Marcelo Vanzin] Add locking in loadClass() to avoid deadlocks.
      2e6c4b7 [Marcelo Vanzin] Mention new setting in documentation.
      b6497f9 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      a10f379 [Marcelo Vanzin] Some feedback.
      3730151 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      f513871 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      44010b6 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      7b57cba [Marcelo Vanzin] Remove now outdated message.
      5304d64 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      35949c8 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      54e1a98 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      d1273b2 [Marcelo Vanzin] Add test file to rat exclude.
      fa1aafa [Marcelo Vanzin] Remove write check on user jars.
      89d8072 [Marcelo Vanzin] Cleanups.
      a963ea3 [Marcelo Vanzin] Implement spark.driver.userClassPathFirst for standalone cluster mode.
      50afa5f [Marcelo Vanzin] Fix Yarn executor command line.
      7d14397 [Marcelo Vanzin] Register user jars in executor up front.
      7f8603c [Marcelo Vanzin] Fix yarn-cluster mode without userClassPathFirst.
      20373f5 [Marcelo Vanzin] Fix ClientBaseSuite.
      55c88fa [Marcelo Vanzin] Run all Yarn integration tests via spark-submit.
      0b64d92 [Marcelo Vanzin] Add deprecation warning to yarn option.
      4a84d87 [Marcelo Vanzin] Fix the child-first class loader.
      d0394b8 [Marcelo Vanzin] Add "deprecated configs" to SparkConf.
      46d8cf2 [Marcelo Vanzin] Update doc with new option, change name to "userClassPathFirst".
      a314f2d [Marcelo Vanzin] Enable driver class path isolation in SparkSubmit.
      91f7e54 [Marcelo Vanzin] [yarn] Enable executor class path isolation.
      a853e74 [Marcelo Vanzin] Re-work CoarseGrainedExecutorBackend command line arguments.
      89522ef [Marcelo Vanzin] Add class path isolation support for Yarn cluster mode.
      20a60131
Loading