Skip to content
Snippets Groups Projects
  1. May 08, 2015
    • Lianhui Wang's avatar
      [SPARK-6869] [PYSPARK] Add pyspark archives path to PYTHONPATH · ebff7327
      Lianhui Wang authored
      Based on https://github.com/apache/spark/pull/5478 that provide a PYSPARK_ARCHIVES_PATH env. within this PR, we just should export PYSPARK_ARCHIVES_PATH=/user/spark/pyspark.zip,/user/spark/python/lib/py4j-0.8.2.1-src.zip in conf/spark-env.sh when we don't install PySpark on each node of Yarn. i run python application successfully on yarn-client and yarn-cluster with this PR.
      andrewor14 sryza Sephiroth-Lin Can you take a look at this?thanks.
      
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      
      Closes #5580 from lianhuiwang/SPARK-6869 and squashes the following commits:
      
      66ffa43 [Lianhui Wang] Update Client.scala
      c2ad0f9 [Lianhui Wang] Update Client.scala
      1c8f664 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
      008850a [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
      f0b4ed8 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
      150907b [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
      20402cd [Lianhui Wang] use ZipEntry
      9d87c3f [Lianhui Wang] update scala style
      e7bd971 [Lianhui Wang] address vanzin's comments
      4b8a3ed [Lianhui Wang] use pyArchivesEnvOpt
      e6b573b [Lianhui Wang] address vanzin's comments
      f11f84a [Lianhui Wang] zip pyspark archives
      5192cca [Lianhui Wang] update import path
      3b1e4c8 [Lianhui Wang] address tgravescs's comments
      9396346 [Lianhui Wang] put zip to make-distribution.sh
      0d2baf7 [Lianhui Wang] update import paths
      e0179be [Lianhui Wang] add zip pyspark archives in build or sparksubmit
      31e8e06 [Lianhui Wang] update code style
      9f31dac [Lianhui Wang] update code and add comments
      f72987c [Lianhui Wang] add archives path to PYTHONPATH
      ebff7327
  2. May 07, 2015
    • Michael Armbrust's avatar
      [SPARK-6908] [SQL] Use isolated Hive client · cd1d4110
      Michael Armbrust authored
      This PR switches Spark SQL's Hive support to use the isolated hive client interface introduced by #5851, instead of directly interacting with the client.  By using this isolated client we can now allow users to dynamically configure the version of Hive that they are connecting to by setting `spark.sql.hive.metastore.version` without the need recompile.  This also greatly reduces the surface area for our interaction with the hive libraries, hopefully making it easier to support other versions in the future.
      
      Jars for the desired hive version can be configured using `spark.sql.hive.metastore.jars`, which accepts the following options:
       - a colon-separated list of jar files or directories for hive and hadoop.
       - `builtin` - attempt to discover the jars that were used to load Spark SQL and use those. This
                  option is only valid when using the execution version of Hive.
       - `maven` - download the correct version of hive on demand from maven.
      
      By default, `builtin` is used for Hive 13.
      
      This PR also removes the test step for building against Hive 12, as this will no longer be required to talk to Hive 12 metastores.  However, the full removal of the Shim is deferred until a later PR.
      
      Remaining TODOs:
       - Remove the Hive Shims and inline code for Hive 13.
       - Several HiveCompatibility tests are not yet passing.
        - `nullformatCTAS` - As detailed below, we now are handling CTAS parsing ourselves instead of hacking into the Hive semantic analyzer.  However, we currently only handle the common cases and not things like CTAS where the null format is specified.
        - `combine1` now leaks state about compression somehow, breaking all subsequent tests.  As such we currently add it to the blacklist
        - `part_inherit_tbl_props` and `part_inherit_tbl_props_with_star` do not work anymore.  We are correctly propagating the information
        - "load_dyn_part14.*" - These tests pass when run on their own, but fail when run with all other tests.  It seems our `RESET` mechanism may not be as robust as it used to be?
      
      Other required changes:
       -  `CreateTableAsSelect` no longer carries parts of the HiveQL AST with it through the query execution pipeline.  Instead, we parse CTAS during the HiveQL conversion and construct a `HiveTable`.  The full parsing here is not yet complete as detailed above in the remaining TODOs.  Since the operator is Hive specific, it is moved to the hive package.
       - `Command` is simplified to be a trait that simply acts as a marker for a LogicalPlan that should be eagerly evaluated.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5876 from marmbrus/useIsolatedClient and squashes the following commits:
      
      258d000 [Michael Armbrust] really really correct path handling
      e56fd4a [Michael Armbrust] getAbsolutePath
      5a259f5 [Michael Armbrust] fix typos
      81bb366 [Michael Armbrust] comments from vanzin
      5f3945e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
      4b5cd41 [Michael Armbrust] yin's comments
      f5de7de [Michael Armbrust] cleanup
      11e9c72 [Michael Armbrust] better coverage in versions suite
      7e8f010 [Michael Armbrust] better error messages and jar handling
      e7b3941 [Michael Armbrust] more permisive checking for function registration
      da91ba7 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
      5fe5894 [Michael Armbrust] fix serialization suite
      81711c4 [Michael Armbrust] Initial support for running without maven
      1d8ae44 [Michael Armbrust] fix final tests?
      1c50813 [Michael Armbrust] more comments
      a3bee70 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
      a6f5df1 [Michael Armbrust] style
      ab07f7e [Michael Armbrust] WIP
      4d8bf02 [Michael Armbrust] Remove hive 12 compilation
      8843a25 [Michael Armbrust] [SPARK-6908] [SQL] Use isolated Hive client
      cd1d4110
  3. Apr 30, 2015
    • Josh Rosen's avatar
      [Build] Enable MiMa checks for SQL · fa01bec4
      Josh Rosen authored
      Now that 1.3 has been released, we should enable MiMa checks for the `sql` subproject.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5727 from JoshRosen/enable-more-mima-checks and squashes the following commits:
      
      3ad302b [Josh Rosen] Merge remote-tracking branch 'origin/master' into enable-more-mima-checks
      0c48e4d [Josh Rosen] Merge remote-tracking branch 'origin/master' into enable-more-mima-checks
      e276cee [Josh Rosen] Fix SQL MiMa checks via excludes and private[sql]
      44d0d01 [Josh Rosen] Add back 'launcher' exclude
      1aae027 [Josh Rosen] Enable MiMa checks for launcher and sql projects.
      fa01bec4
    • Josh Rosen's avatar
      [SPARK-7288] Suppress compiler warnings due to use of sun.misc.Unsafe; add... · 07a86205
      Josh Rosen authored
      [SPARK-7288] Suppress compiler warnings due to use of sun.misc.Unsafe; add facade in front of Unsafe; remove use of Unsafe.setMemory
      
      This patch suppresses compiler warnings due to our use of `sun.misc.Unsafe` (introduced in #5725).  These warnings can only be suppressed via the `-XDignore.symbol.file` javac flag; the `SuppressWarnings` annotation won't work for these.
      
      In order to restrict uses of this compiler flag to the `unsafe` module, I placed a facade in front of `Unsafe` so that other modules won't call it directly. This facade also will also help us to avoid accidental usage of deprecated Unsafe methods or methods that aren't supported in Java 6.
      
      I also removed an unnecessary use of `Unsafe.setMemory`, which isn't present in certain versions of Java 6, and excluded the new `unsafe` module from Javadoc.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5814 from JoshRosen/unsafe-compiler-warnings-fixes and squashes the following commits:
      
      9e8c483 [Josh Rosen] Exclude new unsafe module from Javadoc
      ba75ecf [Josh Rosen] Only apply -XDignore.symbol.file flag in unsafe project.
      7403345 [Josh Rosen] Put facade in front of Unsafe.
      50230c0 [Josh Rosen] Remove usage of Unsafe.setMemory
      96d41c9 [Josh Rosen] Use -XDignore.symbol.file to suppress warnings about sun.misc.Unsafe usage
      07a86205
    • Joseph K. Bradley's avatar
      [SPARK-7207] [ML] [BUILD] Added ml.recommendation, ml.regression to SparkBuild · adbdb19a
      Joseph K. Bradley authored
      Added ml.recommendation, ml.regression to SparkBuild
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5758 from jkbradley/SPARK-7207 and squashes the following commits:
      
      a28158a [Joseph K. Bradley] Added ml.recommendation, ml.regression to SparkBuild
      adbdb19a
  4. Apr 29, 2015
    • Josh Rosen's avatar
      [SPARK-7076][SPARK-7077][SPARK-7080][SQL] Use managed memory for aggregations · f49284b5
      Josh Rosen authored
      This patch adds managed-memory-based aggregation to Spark SQL / DataFrames. Instead of working with Java objects, this new aggregation path uses `sun.misc.Unsafe` to manipulate raw memory.  This reduces the memory footprint for aggregations, resulting in fewer spills, OutOfMemoryErrors, and garbage collection pauses.  As a result, this allows for higher memory utilization.  It can also result in better cache locality since objects will be stored closer together in memory.
      
      This feature can be eanbled by setting `spark.sql.unsafe.enabled=true`.  For now, this feature is only supported when codegen is enabled and only supports aggregations for which the grouping columns are primitive numeric types or strings and aggregated values are numeric.
      
      ### Managing memory with sun.misc.Unsafe
      
      This patch supports both on- and off-heap managed memory.
      
      - In on-heap mode, memory addresses are identified by the combination of a base Object and an offset within that object.
      - In off-heap mode, memory is addressed directly with 64-bit long addresses.
      
      To support both modes, functions that manipulate memory accept both `baseObject` and `baseOffset` fields.  In off-heap mode, we simply pass `null` as `baseObject`.
      
      We allocate memory in large chunks, so memory fragmentation and allocation speed are not significant bottlenecks.
      
      By default, we use on-heap mode.  To enable off-heap mode, set `spark.unsafe.offHeap=true`.
      
      To track allocated memory, this patch extends `SparkEnv` with an `ExecutorMemoryManager` and supplies each `TaskContext` with a `TaskMemoryManager`.  These classes work together to track allocations and detect memory leaks.
      
      ### Compact tuple format
      
      This patch introduces `UnsafeRow`, a compact row layout.  In this format, each tuple has three parts: a null bit set, fixed length values, and variable-length values:
      
      ![image](https://cloud.githubusercontent.com/assets/50748/7328538/2fdb65ce-ea8b-11e4-9743-6c0f02bb7d1f.png)
      
      - Rows are always 8-byte word aligned (so their sizes will always be a multiple of 8 bytes)
      - The bit set is used for null tracking:
      	- Position _i_ is set if and only if field _i_ is null
      	- The bit set is aligned to an 8-byte word boundary.
      - Every field appears as an 8-byte word in the fixed-length values part:
      	- If a field is null, we zero out the values.
      	- If a field is variable-length, the word stores a relative offset (w.r.t. the base of the tuple) that points to the beginning of the field's data in the variable-length part.
      - Each variable-length data type can have its own encoding:
      	- For strings, the first word stores the length of the string and is followed by UTF-8 encoded bytes.  If necessary, the end of the string is padded with empty bytes in order to ensure word-alignment.
      
      For example, a tuple that consists 3 fields of type (int, string, string), with value (null, “data”, “bricks”) would look like this:
      
      ![image](https://cloud.githubusercontent.com/assets/50748/7328526/1e21959c-ea8b-11e4-9a28-a4350fe4a7b5.png)
      
      This format allows us to compare tuples for equality by directly comparing their raw bytes.  This also enables fast hashing of tuples.
      
      ### Hash map for performing aggregations
      
      This patch introduces `UnsafeFixedWidthAggregationMap`, a hash map for performing aggregations where the aggregation result columns are fixed-with.  This map's keys and values are `Row` objects. `UnsafeFixedWidthAggregationMap` is implemented on top of `BytesToBytesMap`, an append-only map which supports byte-array keys and values.
      
      `BytesToBytesMap` stores pointers to key and value tuples.  For each record with a new key, we copy the key and create the aggregation value buffer for that key and put them in a buffer. The hash table then simply stores pointers to the key and value. For each record with an existing key, we simply run the aggregation function to update the values in place.
      
      This map is implemented using open hashing with triangular sequence probing.  Each entry stores two words in a long array: the first word stores the address of the key and the second word stores the relative offset from the key tuple to the value tuple, as well as the key's 32-bit hashcode.  By storing the full hashcode, we reduce the number of equality checks that need to be performed to handle position collisions ()since the chance of hashcode collision is much lower than position collision).
      
      `UnsafeFixedWidthAggregationMap` allows regular Spark SQL `Row` objects to be used when probing the map.  Internally, it encodes these rows into `UnsafeRow` format using `UnsafeRowConverter`.  This conversion has a small overhead that can be eliminated in the future once we use UnsafeRows in other operators.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5725)
      <!-- Reviewable:end -->
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5725 from JoshRosen/unsafe and squashes the following commits:
      
      eeee512 [Josh Rosen] Add converters for Null, Boolean, Byte, and Short columns.
      81f34f8 [Josh Rosen] Follow 'place children last' convention for GeneratedAggregate
      1bc36cc [Josh Rosen] Refactor UnsafeRowConverter to avoid unnecessary boxing.
      017b2dc [Josh Rosen] Remove BytesToBytesMap.finalize()
      50e9671 [Josh Rosen] Throw memory leak warning even in case of error; add warning about code duplication
      70a39e4 [Josh Rosen] Split MemoryManager into ExecutorMemoryManager and TaskMemoryManager:
      6e4b192 [Josh Rosen] Remove an unused method from ByteArrayMethods.
      de5e001 [Josh Rosen] Fix debug vs. trace in logging message.
      a19e066 [Josh Rosen] Rename unsafe Java test suites to match Scala test naming convention.
      78a5b84 [Josh Rosen] Add logging to MemoryManager
      ce3c565 [Josh Rosen] More comments, formatting, and code cleanup.
      529e571 [Josh Rosen] Measure timeSpentResizing in nanoseconds instead of milliseconds.
      3ca84b2 [Josh Rosen] Only zero the used portion of groupingKeyConversionScratchSpace
      162caf7 [Josh Rosen] Fix test compilation
      b45f070 [Josh Rosen] Don't redundantly store the offset from key to value, since we can compute this from the key size.
      a8e4a3f [Josh Rosen] Introduce MemoryManager interface; add to SparkEnv.
      0925847 [Josh Rosen] Disable MiMa checks for new unsafe module
      cde4132 [Josh Rosen] Add missing pom.xml
      9c19fc0 [Josh Rosen] Add configuration options for heap vs. offheap
      6ffdaa1 [Josh Rosen] Null handling improvements in UnsafeRow.
      31eaabc [Josh Rosen] Lots of TODO and doc cleanup.
      a95291e [Josh Rosen] Cleanups to string handling code
      afe8dca [Josh Rosen] Some Javadoc cleanup
      f3dcbfe [Josh Rosen] More mod replacement
      854201a [Josh Rosen] Import and comment cleanup
      06e929d [Josh Rosen] More warning cleanup
      ef6b3d3 [Josh Rosen] Fix a bunch of FindBugs and IntelliJ inspections
      29a7575 [Josh Rosen] Remove debug logging
      49aed30 [Josh Rosen] More long -> int conversion.
      b26f1d3 [Josh Rosen] Fix bug in murmur hash implementation.
      765243d [Josh Rosen] Enable optional performance metrics for hash map.
      23a440a [Josh Rosen] Bump up default hash map size
      628f936 [Josh Rosen] Use ints intead of longs for indexing.
      92d5a06 [Josh Rosen] Address a number of minor code review comments.
      1f4b716 [Josh Rosen] Merge Unsafe code into the regular GeneratedAggregate, guarded by a configuration flag; integrate planner support and re-enable all tests.
      d85eeff [Josh Rosen] Add basic sanity test for UnsafeFixedWidthAggregationMap
      bade966 [Josh Rosen] Comment update (bumping to refresh GitHub cache...)
      b3eaccd [Josh Rosen] Extract aggregation map into its own class.
      d2bb986 [Josh Rosen] Update to implement new Row methods added upstream
      58ac393 [Josh Rosen] Use UNSAFE allocator in GeneratedAggregate (TODO: make this configurable)
      7df6008 [Josh Rosen] Optimizations related to zeroing out memory:
      c1b3813 [Josh Rosen] Fix bug in UnsafeMemoryAllocator.free():
      738fa33 [Josh Rosen] Add feature flag to guard UnsafeGeneratedAggregate
      c55bf66 [Josh Rosen] Free buffer once iterator has been fully consumed.
      62ab054 [Josh Rosen] Optimize for fact that get() is only called on String columns.
      c7f0b56 [Josh Rosen] Reuse UnsafeRow pointer in UnsafeRowConverter
      ae39694 [Josh Rosen] Add finalizer as "cleanup method of last resort"
      c754ae1 [Josh Rosen] Now that the store*() contract has been stregthened, we can remove an extra lookup
      f764d13 [Josh Rosen] Simplify address + length calculation in Location.
      079f1bf [Josh Rosen] Some clarification of the BytesToBytesMap.lookup() / set() contract.
      1a483c5 [Josh Rosen] First version that passes some aggregation tests:
      fc4c3a8 [Josh Rosen] Sketch how the converters will be used in UnsafeGeneratedAggregate
      53ba9b7 [Josh Rosen] Start prototyping Java Row -> UnsafeRow converters
      1ff814d [Josh Rosen] Add reminder to free memory on iterator completion
      8a8f9df [Josh Rosen] Add skeleton for GeneratedAggregate integration.
      5d55cef [Josh Rosen] Add skeleton for Row implementation.
      f03e9c1 [Josh Rosen] Play around with Unsafe implementations of more string methods.
      ab68e08 [Josh Rosen] Begin merging the UTF8String implementations.
      480a74a [Josh Rosen] Initial import of code from Databricks unsafe utils repo.
      f49284b5
  5. Apr 28, 2015
    • Xiangrui Meng's avatar
      [SPARK-6756] [MLLIB] add toSparse, toDense, numActives, numNonzeros, and compressed to Vector · 5ef006fc
      Xiangrui Meng authored
      Add `compressed` to `Vector` with some other methods: `numActives`, `numNonzeros`, `toSparse`, and `toDense`. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5756 from mengxr/SPARK-6756 and squashes the following commits:
      
      8d4ecbd [Xiangrui Meng] address comment and add mima excludes
      da54179 [Xiangrui Meng] add toSparse, toDense, numActives, numNonzeros, and compressed to Vector
      5ef006fc
  6. Apr 27, 2015
    • Yuhao Yang's avatar
      [SPARK-7090] [MLLIB] Introduce LDAOptimizer to LDA to further improve extensibility · 4d9e560b
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-7090
      
      LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms.
      As Joseph Bradley jkbradley proposed in https://github.com/apache/spark/pull/4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly.
      Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA.
      
      Concrete changes:
      
      1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm.
      
      2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future)
              -adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite.
              -move the code from LDA.initalState to initalState of EMLDAOptimizer
      
      3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer.
      
      4. Change the return type of LDA.run from DistributedLDAModel to LDAModel.
      
      Further work:
      add OnlineLDAOptimizer and other possible Optimizers once ready.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #5661 from hhbyyh/ldaRefactor and squashes the following commits:
      
      0e2e006 [Yuhao Yang] respond to review comments
      08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
      e756ce4 [Yuhao Yang] solve mima exception
      d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
      0bb8400 [Yuhao Yang] refactor LDA with Optimizer
      ec2f857 [Yuhao Yang] protoptype for discussion
      4d9e560b
  7. Apr 17, 2015
    • Ilya Ganelin's avatar
      [SPARK-6703][Core] Provide a way to discover existing SparkContext's · c5ed5101
      Ilya Ganelin authored
      I've added a static getOrCreate method to the static SparkContext object that allows one to either retrieve a previously created SparkContext or to instantiate a new one with the provided config. The method accepts an optional SparkConf to make usage intuitive.
      
      Still working on a test for this, basically want to create a new context from scratch, then ensure that subsequent calls don't overwrite that.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      
      Closes #5501 from ilganeli/SPARK-6703 and squashes the following commits:
      
      db9a963 [Ilya Ganelin] Closing second spark context
      1dc0444 [Ilya Ganelin] Added ref equality check
      8c884fa [Ilya Ganelin] Made getOrCreate synchronized
      cb0c6b7 [Ilya Ganelin] Doc updates and code cleanup
      270cfe3 [Ilya Ganelin] [SPARK-6703] Documentation fixes
      15e8dea [Ilya Ganelin] Updated comments and added MiMa Exclude
      0e1567c [Ilya Ganelin] Got rid of unecessary option for AtomicReference
      dfec4da [Ilya Ganelin] Changed activeContext to AtomicReference
      733ec9f [Ilya Ganelin] Fixed some bugs in test code
      8be2f83 [Ilya Ganelin] Replaced match with if
      e92caf7 [Ilya Ganelin] [SPARK-6703] Added test to ensure that getOrCreate both allows creation, retrieval, and a second context if desired
      a99032f [Ilya Ganelin] Spacing fix
      d7a06b8 [Ilya Ganelin] Updated SparkConf class to add getOrCreate method. Started test suite implementation
      c5ed5101
  8. Apr 14, 2015
    • Marcelo Vanzin's avatar
      [SPARK-5808] [build] Package pyspark files in sbt assembly. · 65774370
      Marcelo Vanzin authored
      This turned out to be more complicated than I wanted because the
      layout of python/ doesn't really follow the usual maven conventions.
      So some extra code is needed to copy just the right things.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5461 from vanzin/SPARK-5808 and squashes the following commits:
      
      7153dac [Marcelo Vanzin] Only try to create resource dir if it doesn't already exist.
      ee90e84 [Marcelo Vanzin] [SPARK-5808] [build] Package pyspark files in sbt assembly.
      65774370
  9. Apr 11, 2015
    • Marcelo Vanzin's avatar
      [hotfix] [build] Make sure JAVA_HOME is set for tests. · 694aef0d
      Marcelo Vanzin authored
      This is needed at least for YARN integration tests, since `$JAVA_HOME` is used to launch the executors.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5441 from vanzin/yarn-test-test and squashes the following commits:
      
      3eeec30 [Marcelo Vanzin] Use JAVA_HOME when available, java.home otherwise.
      d71f1bb [Marcelo Vanzin] And sbt too.
      6bda399 [Marcelo Vanzin] WIP: Testing to see whether this fixes the yarn test issue.
      694aef0d
  10. Apr 09, 2015
    • Yuhao Yang's avatar
      [Spark-6693][MLlib]add tostring with max lines and width for matrix · 9c67049b
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-6693
      
      It's kind of annoying when debugging and found you cannot print out the matrix as you want.
      
      original toString of Matrix only print like following,
      0.17810102596909183    0.5616906241468385    ... (10 total)
      0.9692861997823815     0.015558159784155756  ...
      0.8513015122819192     0.031523763918528847  ...
      0.5396875653953941     0.3267864552779176    ...
      
      The   def toString(maxLines : Int, maxWidth : Int) is useful when debuging, logging and saving matrix to files.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #5344 from hhbyyh/addToString and squashes the following commits:
      
      19a6836 [Yuhao Yang] remove extra line
      6314b21 [Yuhao Yang] add exclude
      736c324 [Yuhao Yang] add ut and exclude
      420da39 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into addToString
      c22f352 [Yuhao Yang] style change
      64a9e0f [Yuhao Yang] add specific to string to matrix
      9c67049b
  11. Apr 07, 2015
    • Reynold Xin's avatar
      [SPARK-6750] Upgrade ScalaStyle to 0.7. · 12322159
      Reynold Xin authored
      0.7 fixes a bug that's pretty useful, i.e. inline functions no longer return explicit type definition.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5399 from rxin/style0.7 and squashes the following commits:
      
      54c41b2 [Reynold Xin] Actually update the version.
      09c759c [Reynold Xin] [SPARK-6750] Upgrade ScalaStyle to 0.7.
      12322159
  12. Apr 03, 2015
    • Ilya Ganelin's avatar
      [SPARK-6492][CORE] SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies · 2c43ea38
      Ilya Ganelin authored
      I've added a timeout and retry loop around the SparkContext shutdown code that should fix this deadlock. If a SparkContext shutdown is in progress when another thread comes knocking, it will wait for 10 seconds for the lock, then fall through where the outer loop will re-submit the request.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      
      Closes #5277 from ilganeli/SPARK-6492 and squashes the following commits:
      
      8617a7e [Ilya Ganelin] Resolved merge conflict
      2fbab66 [Ilya Ganelin] Added MIMA Exclude
      a0e2c70 [Ilya Ganelin] Deleted stale imports
      fa28ce7 [Ilya Ganelin] reverted to just having a single stopped
      76fc825 [Ilya Ganelin] Updated to use atomic booleans instead of the synchronized vars
      6e8a7f7 [Ilya Ganelin] Removing unecessary null check for now since i'm not fixing stop ordering yet
      cdf7073 [Ilya Ganelin] [SPARK-6492] Moved stopped=true back to the start of the shutdown sequence so this can be addressed in a seperate PR
      7fb795b [Ilya Ganelin] Spacing
      b7a0c5c [Ilya Ganelin] Import ordering
      df8224f [Ilya Ganelin] Added comment for added lock
      343cb94 [Ilya Ganelin] [SPARK-6492] Added timeout/retry logic to fix a deadlock in SparkContext shutdown
      2c43ea38
  13. Apr 01, 2015
    • Ilya Ganelin's avatar
      [SPARK-4655][Core] Split Stage into ShuffleMapStage and ResultStage subclasses · ff1915e1
      Ilya Ganelin authored
      Hi all - this patch changes the Stage class to an abstract class and introduces two new classes that extend it: ShuffleMapStage and ResultStage - with the goal of increasing readability of the DAGScheduler class. Their usage is updated within DAGScheduler.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      Author: Ilya Ganelin <ilganeli@gmail.com>
      
      Closes #4708 from ilganeli/SPARK-4655 and squashes the following commits:
      
      c248924 [Ilya Ganelin] Merge branch 'SPARK-4655' of github.com:ilganeli/spark into SPARK-4655
      d930385 [Ilya Ganelin] Fixed merge conflict from
      a9a765f [Ilya Ganelin] Update DAGScheduler.scala
      c03563c [Ilya Ganelin] Minor fixeS
      c39e971 [Ilya Ganelin] Added return typing for public methods
      845bc87 [Ilya Ganelin] Merge branch 'SPARK-4655' of github.com:ilganeli/spark into SPARK-4655
      e8031d8 [Ilya Ganelin] Minor string fixes
      4ec53ac [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-4655
      c004f62 [Ilya Ganelin] Update DAGScheduler.scala
      a2cb03f [Ilya Ganelin] [SPARK-4655] Replaced usages of Nil and eliminated some code reuse
      3d5cf20 [Ilya Ganelin] [SPARK-4655] Moved mima exclude to 1.4
      6912c55 [Ilya Ganelin] Resolved merge conflict
      4bff208 [Ilya Ganelin] Minor stylistic fixes
      c6fffbb [Ilya Ganelin] newline
      41402ad [Ilya Ganelin] Style fixes
      02c6981 [Ilya Ganelin] Merge branch 'SPARK-4655' of github.com:ilganeli/spark into SPARK-4655
      c755a09 [Ilya Ganelin] Some more stylistic updates and minor refactoring
      b6257a0 [Ilya Ganelin] Update MimaExcludes.scala
      0f0c624 [Ilya Ganelin] Fixed merge conflict
      2eba262 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-4655
      6b43d7b [Ilya Ganelin] Got rid of some spaces
      6f1a5db [Ilya Ganelin] Revert "More minor formatting and refactoring"
      1b3471b [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-4655
      c9288e2 [Ilya Ganelin] More minor formatting and refactoring
      d548caf [Ilya Ganelin] Formatting fix
      c3ae5c2 [Ilya Ganelin] Explicit typing
      0dacaf3 [Ilya Ganelin] Got rid of stale import
      6da3a71 [Ilya Ganelin] Trailing whitespace
      b85c5fe [Ilya Ganelin] Added minor fixes
      a57dfcd [Ilya Ganelin] Added MiMA exclusion to get around binary compatibility check
      83ed849 [Ilya Ganelin] moved braces for consistency
      96dd161 [Ilya Ganelin] Fixed minor style error
      cfd6f10 [Ilya Ganelin] Updated DAGScheduler to use new ResultStage and ShuffleMapStage classes
      83494e9 [Ilya Ganelin] Added new Stage classes
      ff1915e1
  14. Mar 30, 2015
    • CodingCat's avatar
      [SPARK-6592][SQL] fix filter for scaladoc to generate API doc for Row class under catalyst dir · 32259c67
      CodingCat authored
      https://issues.apache.org/jira/browse/SPARK-6592
      
      The current impl in SparkBuild.scala filter all classes under catalyst directory, however, we have a corner case that Row class is a public API under that directory
      
      we need to include Row into the scaladoc while still excluding other classes of catalyst project
      
      Thanks for the help on this patch from rxin and liancheng
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #5252 from CodingCat/SPARK-6592 and squashes the following commits:
      
      02098a4 [CodingCat] ignore collection, enable types (except those protected classes)
      f7af2cb [CodingCat] commit
      3ab4403 [CodingCat] fix filter for scaladoc to generate API doc for Row.scala under catalyst directory
      32259c67
  15. Mar 29, 2015
    • zsxwing's avatar
      [SPARK-5124][Core] A standard RPC interface and an Akka implementation · a8d53afb
      zsxwing authored
      This PR added a standard internal RPC interface for Spark and an Akka implementation. See [the design document](https://issues.apache.org/jira/secure/attachment/12698710/Pluggable%20RPC%20-%20draft%202.pdf) for more details.
      
      I will split the whole work into multiple PRs to make it easier for code review. This is the first PR and avoid to touch too many files.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #4588 from zsxwing/rpc-part1 and squashes the following commits:
      
      fe3df4c [zsxwing] Move registerEndpoint and use actorSystem.dispatcher in asyncSetupEndpointRefByURI
      f6f3287 [zsxwing] Remove RpcEndpointRef.toURI
      8bd1097 [zsxwing] Fix docs and the code style
      f459380 [zsxwing] Add RpcAddress.fromURI and rename urls to uris
      b221398 [zsxwing] Move send methods above ask methods
      15cfd7b [zsxwing] Merge branch 'master' into rpc-part1
      9ffa997 [zsxwing] Fix MiMa tests
      78a1733 [zsxwing] Merge remote-tracking branch 'origin/master' into rpc-part1
      385b9c3 [zsxwing] Fix the code style and add docs
      2cc3f78 [zsxwing] Add an asynchronous version of setupEndpointRefByUrl
      e8dfec3 [zsxwing] Remove 'sendWithReply(message: Any, sender: RpcEndpointRef): Unit'
      08564ae [zsxwing] Add RpcEnvFactory to create RpcEnv
      e5df4ca [zsxwing] Handle AkkaFailure(e) in Actor
      ec7c5b0 [zsxwing] Fix docs
      7fc95e1 [zsxwing] Implement askWithReply in RpcEndpointRef
      9288406 [zsxwing] Document thread-safety for setupThreadSafeEndpoint
      3007c09 [zsxwing] Move setupDriverEndpointRef to RpcUtils and rename to makeDriverRef
      c425022 [zsxwing] Fix the code style
      5f87700 [zsxwing] Move the logical of processing message to a private function
      3e56123 [zsxwing] Use lazy to eliminate CountDownLatch
      07f128f [zsxwing] Remove ActionScheduler.scala
      4d34191 [zsxwing] Remove scheduler from RpcEnv
      7cdd95e [zsxwing] Add docs for RpcEnv
      51e6667 [zsxwing] Add 'sender' to RpcCallContext and rename the parameter of receiveAndReply to 'context'
      ffc1280 [zsxwing] Rename 'fail' to 'sendFailure' and other minor code style changes
      28e6d0f [zsxwing] Add onXXX for network events and remove the companion objects of network events
      3751c97 [zsxwing] Rename RpcResponse to RpcCallContext
      fe7d1ff [zsxwing] Add explicit reply in rpc
      7b9e0c9 [zsxwing] Fix the indentation
      04a106e [zsxwing] Remove NopCancellable and add a const NOP in object SettableCancellable
      2a579f4 [zsxwing] Remove RpcEnv.systemName
      155b987 [zsxwing] Change newURI to uriOf and add some comments
      45b2317 [zsxwing] A standard RPC interface and An Akka implementation
      a8d53afb
  16. Mar 26, 2015
    • Brennon York's avatar
      [SPARK-6510][GraphX]: Add Graph#minus method to act as Set#difference · 39fb5796
      Brennon York authored
      Adds a `Graph#minus` method which will return only unique `VertexId`'s from the calling `VertexRDD`.
      
      To demonstrate a basic example with pseudocode:
      
      ```
      Set((0L,0),(1L,1)).minus(Set((1L,1),(2L,2)))
      > Set((0L,0))
      ```
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #5175 from brennonyork/SPARK-6510 and squashes the following commits:
      
      248d5c8 [Brennon York] added minus(VertexRDD[VD]) method to avoid createUsingIndex and updated the mask operations to simplify with andNot call
      3fb7cce [Brennon York] updated graphx doc to reflect the addition of minus method
      6575d92 [Brennon York] updated mima exclude
      aaa030b [Brennon York] completed graph#minus functionality
      7227c0f [Brennon York] beginning work on minus functionality
      39fb5796
  17. Mar 24, 2015
    • Reynold Xin's avatar
      [SPARK-6428] Added explicit types for all public methods in core. · 4ce2782a
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5125 from rxin/core-explicit-type and squashes the following commits:
      
      f471415 [Reynold Xin] Revert style checker changes.
      81b66e4 [Reynold Xin] Code review feedback.
      a7533e3 [Reynold Xin] Mima excludes.
      1d795f5 [Reynold Xin] [SPARK-6428] Added explicit types for all public methods in core.
      4ce2782a
  18. Mar 20, 2015
    • Marcelo Vanzin's avatar
      [SPARK-6371] [build] Update version to 1.4.0-SNAPSHOT. · a7456459
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5056 from vanzin/SPARK-6371 and squashes the following commits:
      
      63220df [Marcelo Vanzin] Merge branch 'master' into SPARK-6371
      6506f75 [Marcelo Vanzin] Use more fine-grained exclusion.
      178ba71 [Marcelo Vanzin] Oops.
      75b2375 [Marcelo Vanzin] Exclude VertexRDD in MiMA.
      a45a62c [Marcelo Vanzin] Work around MIMA warning.
      1d8a670 [Marcelo Vanzin] Re-group jetty exclusion.
      0e8e909 [Marcelo Vanzin] Ignore ml, don't ignore graphx.
      cef4603 [Marcelo Vanzin] Indentation.
      296cf82 [Marcelo Vanzin] [SPARK-6371] [build] Update version to 1.4.0-SNAPSHOT.
      a7456459
  19. Mar 16, 2015
    • Brennon York's avatar
      [SPARK-5922][GraphX]: Add diff(other: RDD[VertexId, VD]) in VertexRDD · 45f4c661
      Brennon York authored
      Changed method invocation of 'diff' to match that of 'innerJoin' and 'leftJoin' from VertexRDD[VD] to RDD[(VertexId, VD)]. This change maintains backwards compatibility and better unifies the VertexRDD methods to match each other.
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #4733 from brennonyork/SPARK-5922 and squashes the following commits:
      
      e800f08 [Brennon York] fixed merge conflicts
      b9274af [Brennon York] fixed merge conflicts
      f86375c [Brennon York] fixed minor include line
      398ddb4 [Brennon York] fixed merge conflicts
      aac1810 [Brennon York] updated to aggregateUsingIndex and added test to ensure that method works properly
      2af0b88 [Brennon York] removed deprecation line
      753c963 [Brennon York] fixed merge conflicts and set preference to use the diff(other: VertexRDD[VD]) method
      2c678c6 [Brennon York] added mima exclude to exclude new public diff method from VertexRDD
      93186f3 [Brennon York] added back the original diff method to sustain binary compatibility
      f18356e [Brennon York] changed method invocation of 'diff' to match that of 'innerJoin' and 'leftJoin' from VertexRDD[VD] to RDD[(VertexId, VD)]
      45f4c661
  20. Mar 15, 2015
    • OopsOutOfMemory's avatar
      [SPARK-6285][SQL]Remove ParquetTestData in SparkBuild.scala and in README.md · 62ede538
      OopsOutOfMemory authored
      This is a following clean up PR for #5010
      This will resolve issues when launching `hive/console` like below:
      ```
      <console>:20: error: object ParquetTestData is not a member of package org.apache.spark.sql.parquet
             import org.apache.spark.sql.parquet.ParquetTestData
      ```
      
      Author: OopsOutOfMemory <victorshengli@126.com>
      
      Closes #5032 from OopsOutOfMemory/SPARK-6285 and squashes the following commits:
      
      2996aeb [OopsOutOfMemory] remove ParquetTestData
      62ede538
  21. Mar 13, 2015
    • vinodkc's avatar
      [SPARK-6317][SQL]Fixed HIVE console startup issue · e360d5e4
      vinodkc authored
      Author: vinodkc <vinod.kc.in@gmail.com>
      Author: Vinod K C <vinod.kc@huawei.com>
      
      Closes #5011 from vinodkc/HIVE_console_startupError and squashes the following commits:
      
      b43925f [vinodkc] Changed order of import
      b4f5453 [Vinod K C] Fixed HIVE console startup issue
      e360d5e4
  22. Mar 12, 2015
    • Xiangrui Meng's avatar
      [SPARK-4588] ML Attributes · a4b27162
      Xiangrui Meng authored
      This continues the work in #4460 from srowen . The design doc is published on the JIRA page with some minor changes.
      
      Short description of ML attributes: https://github.com/apache/spark/pull/4925/files?diff=unified#diff-95e7f5060429f189460b44a3f8731a35R24
      
      More details can be found in the design doc.
      
      srowen Could you help review this PR? There are many lines but most of them are boilerplate code.
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4925 from mengxr/SPARK-4588-new and squashes the following commits:
      
      71d1bd0 [Xiangrui Meng] add JavaDoc for package ml.attribute
      617be40 [Xiangrui Meng] remove final; rename cardinality to numValues
      393ffdc [Xiangrui Meng] forgot to include Java attribute group tests
      b1aceef [Xiangrui Meng] more tests
      e7ab467 [Xiangrui Meng] update ML attribute impl
      7c944da [Sean Owen] Add FeatureType hierarchy and categorical cardinality
      2a21d6d [Sean Owen] Initial draft of FeatureAttributes class
      a4b27162
    • Xiangrui Meng's avatar
      [SPARK-5814][MLLIB][GRAPHX] Remove JBLAS from runtime · 0cba802a
      Xiangrui Meng authored
      The issue is discussed in https://issues.apache.org/jira/browse/SPARK-5669. Replacing all JBLAS usage by netlib-java gives us a simpler dependency tree and less license issues to worry about. I didn't touch the test scope in this PR. The user guide is not modified to avoid merge conflicts with branch-1.3. srowen ankurdave pwendell
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4699 from mengxr/SPARK-5814 and squashes the following commits:
      
      48635c6 [Xiangrui Meng] move netlib-java version to parent pom
      ca21c74 [Xiangrui Meng] remove jblas from ml-guide
      5f7767a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5814
      c5c4183 [Xiangrui Meng] merge master
      0f20cad [Xiangrui Meng] add mima excludes
      e53e9f4 [Xiangrui Meng] remove jblas from mllib runtime
      ceaa14d [Xiangrui Meng] replace jblas by netlib-java in graphx
      fa7c2ca [Xiangrui Meng] move jblas to test scope
      0cba802a
  23. Mar 11, 2015
    • Marcelo Vanzin's avatar
      [SPARK-4924] Add a library for launching Spark jobs programmatically. · 517975d8
      Marcelo Vanzin authored
      This change encapsulates all the logic involved in launching a Spark job
      into a small Java library that can be easily embedded into other applications.
      
      The overall goal of this change is twofold, as described in the bug:
      
      - Provide a public API for launching Spark processes. This is a common request
        from users and currently there's no good answer for it.
      
      - Remove a lot of the duplicated code and other coupling that exists in the
        different parts of Spark that deal with launching processes.
      
      A lot of the duplication was due to different code needed to build an
      application's classpath (and the bootstrapper needed to run the driver in
      certain situations), and also different code needed to parse spark-submit
      command line options in different contexts. The change centralizes those
      as much as possible so that all code paths can rely on the library for
      handling those appropriately.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #3916 from vanzin/SPARK-4924 and squashes the following commits:
      
      18c7e4d [Marcelo Vanzin] Fix make-distribution.sh.
      2ce741f [Marcelo Vanzin] Add lots of quotes.
      3b28a75 [Marcelo Vanzin] Update new pom.
      a1b8af1 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      897141f [Marcelo Vanzin] Review feedback.
      e2367d2 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      28cd35e [Marcelo Vanzin] Remove stale comment.
      b1d86b0 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      00505f9 [Marcelo Vanzin] Add blurb about new API in the programming guide.
      5f4ddcc [Marcelo Vanzin] Better usage messages.
      92a9cfb [Marcelo Vanzin] Fix Win32 launcher, usage.
      6184c07 [Marcelo Vanzin] Rename field.
      4c19196 [Marcelo Vanzin] Update comment.
      7e66c18 [Marcelo Vanzin] Fix pyspark tests.
      0031a8e [Marcelo Vanzin] Review feedback.
      c12d84b [Marcelo Vanzin] Review feedback. And fix spark-submit on Windows.
      e2d4d71 [Marcelo Vanzin] Simplify some code used to launch pyspark.
      43008a7 [Marcelo Vanzin] Don't make builder extend SparkLauncher.
      b4d6912 [Marcelo Vanzin] Use spark-submit script in SparkLauncher.
      28b1434 [Marcelo Vanzin] Add a comment.
      304333a [Marcelo Vanzin] Fix propagation of properties file arg.
      bb67b93 [Marcelo Vanzin] Remove unrelated Yarn change (that is also wrong).
      8ec0243 [Marcelo Vanzin] Add missing newline.
      95ddfa8 [Marcelo Vanzin] Fix handling of --help for spark-class command builder.
      72da7ec [Marcelo Vanzin] Rename SparkClassLauncher.
      62978e4 [Marcelo Vanzin] Minor cleanup of Windows code path.
      9cd5b44 [Marcelo Vanzin] Make all non-public APIs package-private.
      e4c80b6 [Marcelo Vanzin] Reorganize the code so that only SparkLauncher is public.
      e50dc5e [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      de81da2 [Marcelo Vanzin] Fix CommandUtils.
      86a87bf [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      2061967 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      46d46da [Marcelo Vanzin] Clean up a test and make it more future-proof.
      b93692a [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      ad03c48 [Marcelo Vanzin] Revert "Fix a thread-safety issue in "local" mode."
      0b509d0 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      23aa2a9 [Marcelo Vanzin] Read java-opts from conf dir, not spark home.
      7cff919 [Marcelo Vanzin] Javadoc updates.
      eae4d8e [Marcelo Vanzin] Fix new unit tests on Windows.
      e570fb5 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      44cd5f7 [Marcelo Vanzin] Add package-info.java, clean up javadocs.
      f7cacff [Marcelo Vanzin] Remove "launch Spark in new thread" feature.
      7ed8859 [Marcelo Vanzin] Some more feedback.
      54cd4fd [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      61919df [Marcelo Vanzin] Clean leftover debug statement.
      aae5897 [Marcelo Vanzin] Use launcher classes instead of jars in non-release mode.
      e584fc3 [Marcelo Vanzin] Rework command building a little bit.
      525ef5b [Marcelo Vanzin] Rework Unix spark-class to handle argument with newlines.
      8ac4e92 [Marcelo Vanzin] Minor test cleanup.
      e946a99 [Marcelo Vanzin] Merge PySparkLauncher into SparkSubmitCliLauncher.
      c617539 [Marcelo Vanzin] Review feedback round 1.
      fc6a3e2 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      f26556b [Marcelo Vanzin] Fix a thread-safety issue in "local" mode.
      2f4e8b4 [Marcelo Vanzin] Changes needed to make this work with SPARK-4048.
      799fc20 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      bb5d324 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      53faef1 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      a7936ef [Marcelo Vanzin] Fix pyspark tests.
      656374e [Marcelo Vanzin] Mima fixes.
      4d511e7 [Marcelo Vanzin] Fix tools search code.
      7a01e4a [Marcelo Vanzin] Fix pyspark on Yarn.
      1b3f6e9 [Marcelo Vanzin] Call SparkSubmit from spark-class launcher for unknown classes.
      25c5ae6 [Marcelo Vanzin] Centralize SparkSubmit command line parsing.
      27be98a [Marcelo Vanzin] Modify Spark to use launcher lib.
      6f70eea [Marcelo Vanzin] [SPARK-4924] Add a library for launching Spark jobs programatically.
      517975d8
  24. Mar 03, 2015
    • Reynold Xin's avatar
      [SPARK-5310][SQL] Fixes to Docs and Datasources API · 54d19689
      Reynold Xin authored
       - Various Fixes to docs
       - Make data source traits actually interfaces
      
      Based on #4862 but with fixed conflicts.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4868 from marmbrus/pr/4862 and squashes the following commits:
      
      fe091ea [Michael Armbrust] Merge remote-tracking branch 'origin/master' into pr/4862
      0208497 [Reynold Xin] Test fixes.
      34e0a28 [Reynold Xin] [SPARK-5310][SQL] Various fixes to Spark SQL docs.
      54d19689
  25. Feb 19, 2015
    • Sean Owen's avatar
      SPARK-4682 [CORE] Consolidate various 'Clock' classes · 34b7c353
      Sean Owen authored
      Another one from JoshRosen 's wish list. The first commit is much smaller and removes 2 of the 4 Clock classes. The second is much larger, necessary for consolidating the streaming one. I put together implementations in the way that seemed simplest. Almost all the change is standardizing class and method names.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4514 from srowen/SPARK-4682 and squashes the following commits:
      
      5ed3a03 [Sean Owen] Javadoc Clock classes; make ManualClock private[spark]
      169dd13 [Sean Owen] Add support for legacy org.apache.spark.streaming clock class names
      277785a [Sean Owen] Reduce the net change in this patch by reversing some unnecessary syntax changes along the way
      b5e53df [Sean Owen] FakeClock -> ManualClock; getTime() -> getTimeMillis()
      160863a [Sean Owen] Consolidate Streaming Clock class into common util Clock
      7c956b2 [Sean Owen] Consolidate Clocks except for Streaming Clock
      34b7c353
  26. Feb 17, 2015
    • Michael Armbrust's avatar
      [SPARK-5166][SPARK-5247][SPARK-5258][SQL] API Cleanup / Documentation · c74b07fa
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4642 from marmbrus/docs and squashes the following commits:
      
      d291c34 [Michael Armbrust] python tests
      9be66e3 [Michael Armbrust] comments
      d56afc2 [Michael Armbrust] fix style
      f004747 [Michael Armbrust] fix build
      c4a907b [Michael Armbrust] fix tests
      42e2b73 [Michael Armbrust] [SQL] Documentation / API Clean-up.
      c74b07fa
  27. Feb 09, 2015
    • Marcelo Vanzin's avatar
      [SPARK-2996] Implement userClassPathFirst for driver, yarn. · 20a60131
      Marcelo Vanzin authored
      Yarn's config option `spark.yarn.user.classpath.first` does not work the same way as
      `spark.files.userClassPathFirst`; Yarn's version is a lot more dangerous, in that it
      modifies the system classpath, instead of restricting the changes to the user's class
      loader. So this change implements the behavior of the latter for Yarn, and deprecates
      the more dangerous choice.
      
      To be able to achieve feature-parity, I also implemented the option for drivers (the existing
      option only applies to executors). So now there are two options, each controlling whether
      to apply userClassPathFirst to the driver or executors. The old option was deprecated, and
      aliased to the new one (`spark.executor.userClassPathFirst`).
      
      The existing "child-first" class loader also had to be fixed. It didn't handle resources, and it
      was also doing some things that ended up causing JVM errors depending on how things
      were being called.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #3233 from vanzin/SPARK-2996 and squashes the following commits:
      
      9cf9cf1 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      a1499e2 [Marcelo Vanzin] Remove SPARK_HOME propagation.
      fa7df88 [Marcelo Vanzin] Remove 'test.resource' file, create it dynamically.
      a8c69f1 [Marcelo Vanzin] Review feedback.
      cabf962 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      a1b8d7e [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      3f768e3 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      2ce3c7a [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      0e6d6be [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      70d4044 [Marcelo Vanzin] Fix pyspark/yarn-cluster test.
      0fe7777 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      0e6ef19 [Marcelo Vanzin] Move class loaders around and make names more meaninful.
      fe970a7 [Marcelo Vanzin] Review feedback.
      25d4fed [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      3cb6498 [Marcelo Vanzin] Call the right loadClass() method on the parent.
      fbb8ab5 [Marcelo Vanzin] Add locking in loadClass() to avoid deadlocks.
      2e6c4b7 [Marcelo Vanzin] Mention new setting in documentation.
      b6497f9 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      a10f379 [Marcelo Vanzin] Some feedback.
      3730151 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      f513871 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      44010b6 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      7b57cba [Marcelo Vanzin] Remove now outdated message.
      5304d64 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      35949c8 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      54e1a98 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      d1273b2 [Marcelo Vanzin] Add test file to rat exclude.
      fa1aafa [Marcelo Vanzin] Remove write check on user jars.
      89d8072 [Marcelo Vanzin] Cleanups.
      a963ea3 [Marcelo Vanzin] Implement spark.driver.userClassPathFirst for standalone cluster mode.
      50afa5f [Marcelo Vanzin] Fix Yarn executor command line.
      7d14397 [Marcelo Vanzin] Register user jars in executor up front.
      7f8603c [Marcelo Vanzin] Fix yarn-cluster mode without userClassPathFirst.
      20373f5 [Marcelo Vanzin] Fix ClientBaseSuite.
      55c88fa [Marcelo Vanzin] Run all Yarn integration tests via spark-submit.
      0b64d92 [Marcelo Vanzin] Add deprecation warning to yarn option.
      4a84d87 [Marcelo Vanzin] Fix the child-first class loader.
      d0394b8 [Marcelo Vanzin] Add "deprecated configs" to SparkConf.
      46d8cf2 [Marcelo Vanzin] Update doc with new option, change name to "userClassPathFirst".
      a314f2d [Marcelo Vanzin] Enable driver class path isolation in SparkSubmit.
      91f7e54 [Marcelo Vanzin] [yarn] Enable executor class path isolation.
      a853e74 [Marcelo Vanzin] Re-work CoarseGrainedExecutorBackend command line arguments.
      89522ef [Marcelo Vanzin] Add class path isolation support for Yarn cluster mode.
      20a60131
  28. Feb 07, 2015
    • Michael Armbrust's avatar
      [BUILD] Add the ability to launch spark-shell from SBT. · e9a4fe12
      Michael Armbrust authored
      Now you can quickly launch the spark-shell without building an assembly.  For quick development iteration run `build/sbt ~sparkShell` and calling exit will relaunch with any changes.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4438 from marmbrus/sparkShellSbt and squashes the following commits:
      
      b4e44fe [Michael Armbrust] [BUILD] Add the ability to launch spark-shell from SBT.
      e9a4fe12
  29. Feb 06, 2015
    • OopsOutOfMemory's avatar
      [SQL][HiveConsole][DOC] HiveConsole `correct hiveconsole imports` · b62c3524
      OopsOutOfMemory authored
      Sorry for that PR #4330 has some mistakes.
      
      I correct it....  so it works correctly now.
      
      Author: OopsOutOfMemory <victorshengli@126.com>
      
      Closes #4389 from OopsOutOfMemory/doc and squashes the following commits:
      
      843eed9 [OopsOutOfMemory] correct hiveconsole imports
      b62c3524
    • Joseph K. Bradley's avatar
      [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib] Standardize ML Prediction APIs · dc0c4490
      Joseph K. Bradley authored
      This is part (1a) of the updates from the design doc in [https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]
      
      **UPDATE**: Most of the APIs are being kept private[spark] to allow further discussion.  Here is a list of changes which are public:
      * new output columns: rawPrediction, probabilities
        * The “score” column is now called “rawPrediction”
      * Classifiers now provide numClasses
      * Params.get and .set are now protected instead of private[ml].
      * ParamMap now has a size method.
      * new classes: LinearRegression, LinearRegressionModel
      * LogisticRegression now has an intercept.
      
      ### Sketch of APIs (most of which are private[spark] for now)
      
      Abstract classes for learning algorithms (+ corresponding Model abstractions):
      * Classifier (+ ClassificationModel)
      * ProbabilisticClassifier (+ ProbabilisticClassificationModel)
      * Regressor (+ RegressionModel)
      * Predictor (+ PredictionModel)
      * *For all of these*:
       * There is no strongly typed training-time API.
       * There is a strongly typed test-time (prediction) API which helps developers implement new algorithms.
      
      Concrete classes: learning algorithms
      * LinearRegression
      * LogisticRegression (updated to use new abstract classes)
       * Also, removed "score" in favor of "probability" output column.  Changed BinaryClassificationEvaluator to match. (SPARK-5031)
      
      Other updates:
      * params.scala: Changed Params.set/get to be protected instead of private[ml]
       * This was needed for the example of defining a class from outside of the MLlib namespace.
      * VectorUDT: Will later change from private[spark] to public.
       * This is needed for outside users to write their own validateAndTransformSchema() methods using vectors.
       * Also, added equals() method.f
      * SPARK-4942 : ML Transformers should allow output cols to be turned on,off
       * Update validateAndTransformSchema
       * Update transform
      * (Updated examples, test suites according to other changes)
      
      New examples:
      * DeveloperApiExample.scala (example of defining algorithm from outside of the MLlib namespace)
       * Added Java version too
      
      Test Suites:
      * LinearRegressionSuite
      * LogisticRegressionSuite
      * + Java versions of above suites
      
      CC: mengxr  etrain  shivaram
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #3637 from jkbradley/ml-api-part1 and squashes the following commits:
      
      405bfb8 [Joseph K. Bradley] Last edits based on code review.  Small cleanups
      fec348a [Joseph K. Bradley] Added JavaDeveloperApiExample.java and fixed other issues: Made developer API private[spark] for now. Added constructors Java can understand to specialized Param types.
      8316d5e [Joseph K. Bradley] fixes after rebasing on master
      fc62406 [Joseph K. Bradley] fixed test suites after last commit
      bcb9549 [Joseph K. Bradley] Fixed issues after rebasing from master (after move from SchemaRDD to DataFrame)
      9872424 [Joseph K. Bradley] fixed JavaLinearRegressionSuite.java Java sql api
      f542997 [Joseph K. Bradley] Added MIMA excludes for VectorUDT (now public), and added DeveloperApi annotation to it
      216d199 [Joseph K. Bradley] fixed after sql datatypes PR got merged
      f549e34 [Joseph K. Bradley] Updates based on code review.  Major ones are: * Created weakly typed Predictor.train() method which is called by fit() so that developers do not have to call schema validation or copy parameters. * Made Predictor.featuresDataType have a default value of VectorUDT.   * NOTE: This could be dangerous since the FeaturesType type parameter cannot have a default value.
      343e7bd [Joseph K. Bradley] added blanket mima exclude for ml package
      82f340b [Joseph K. Bradley] Fixed bug in LogisticRegression (introduced in this PR).  Fixed Java suites
      0a16da9 [Joseph K. Bradley] Fixed Linear/Logistic RegressionSuites
      c3c8da5 [Joseph K. Bradley] small cleanup
      934f97b [Joseph K. Bradley] Fixed bugs from previous commit.
      1c61723 [Joseph K. Bradley] * Made ProbabilisticClassificationModel into a subclass of ClassificationModel.  Also introduced ProbabilisticClassifier.  * This was to support output column “probabilityCol” in transform().
      4e2f711 [Joseph K. Bradley] rat fix
      bc654e1 [Joseph K. Bradley] Added spark.ml LinearRegressionSuite
      8d13233 [Joseph K. Bradley] Added methods: * Classifier: batch predictRaw() * Predictor: train() without paramMap ProbabilisticClassificationModel.predictProbabilities() * Java versions of all above batch methods + others
      1680905 [Joseph K. Bradley] Added JavaLabeledPointSuite.java for spark.ml, and added constructor to LabeledPoint which defaults weight to 1.0
      adbe50a [Joseph K. Bradley] * fixed LinearRegression train() to use embedded paramMap * added Predictor.predict(RDD[Vector]) method * updated Linear/LogisticRegressionSuites
      58802e3 [Joseph K. Bradley] added train() to Predictor subclasses which does not take a ParamMap.
      57d54ab [Joseph K. Bradley] * Changed semantics of Predictor.train() to merge the given paramMap with the embedded paramMap. * remove threshold_internal from logreg * Added Predictor.copy() * Extended LogisticRegressionSuite
      e433872 [Joseph K. Bradley] Updated docs.  Added LabeledPointSuite to spark.ml
      54b7b31 [Joseph K. Bradley] Fixed issue with logreg threshold being set correctly
      0617d61 [Joseph K. Bradley] Fixed bug from last commit (sorting paramMap by parameter names in toString).  Fixed bug in persisting logreg data.  Added threshold_internal to logreg for faster test-time prediction (avoiding map lookup).
      601e792 [Joseph K. Bradley] Modified ParamMap to sort parameters in toString.  Cleaned up classes in class hierarchy, before implementing tests and examples.
      d705e87 [Joseph K. Bradley] Added LinearRegression and Regressor back from ml-api branch
      52f4fde [Joseph K. Bradley] removing everything except for simple class hierarchy for classification
      d35bb5d [Joseph K. Bradley] fixed compilation issues, but have not added tests yet
      bfade12 [Joseph K. Bradley] Added lots of classes for new ML API:
      dc0c4490
  30. Feb 05, 2015
    • Xiangrui Meng's avatar
      [SPARK-5620][DOC] group methods in generated unidoc · 85ccee81
      Xiangrui Meng authored
      It seems that `(ScalaUnidoc, unidoc)` is the correct way to overwrite `scalacOptions` in unidoc.
      
      CC: rxin gzm0
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4404 from mengxr/SPARK-5620 and squashes the following commits:
      
      f890cf5 [Xiangrui Meng] add -groups to scalacOptions in unidoc
      85ccee81
  31. Feb 04, 2015
    • OopsOutOfMemory's avatar
      [SQL][Hiveconsole] Bring hive console code up to date and update README.md · b73d5fff
      OopsOutOfMemory authored
      Add `import org.apache.spark.sql.Dsl._` to make DSL query works.
      Since queryExecution is not avaliable in DataFrame, so remove it.
      
      Author: OopsOutOfMemory <victorshengli@126.com>
      Author: Sheng, Li <OopsOutOfMemory@users.noreply.github.com>
      
      Closes #4330 from OopsOutOfMemory/hiveconsole and squashes the following commits:
      
      46eb790 [Sheng, Li] Update SparkBuild.scala
      d23ee9f [OopsOutOfMemory] minor
      d4dd593 [OopsOutOfMemory] refine hive console
      b73d5fff
  32. Feb 03, 2015
    • Xiangrui Meng's avatar
      [SPARK-5536] replace old ALS implementation by the new one · 0cc7b88c
      Xiangrui Meng authored
      The only issue is that `analyzeBlock` is removed, which was marked as a developer API. I didn't change other tests in the ALSSuite under `spark.mllib` to ensure that the implementation is correct.
      
      CC: srowen coderxiang
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4321 from mengxr/SPARK-5536 and squashes the following commits:
      
      5a3cee8 [Xiangrui Meng] update python tests that are too strict
      e840acf [Xiangrui Meng] ignore scala style check for ALS.train
      e9a721c [Xiangrui Meng] update mima excludes
      9ee6a36 [Xiangrui Meng] merge master
      9a8aeac [Xiangrui Meng] update tests
      d8c3271 [Xiangrui Meng] remove analyzeBlocks
      d68eee7 [Xiangrui Meng] add checkpoint to new ALS
      22a56f8 [Xiangrui Meng] wrap old ALS
      c387dff [Xiangrui Meng] support random seed
      3bdf24b [Xiangrui Meng] make storage level configurable in the new ALS
      0cc7b88c
  33. Feb 02, 2015
    • Davies Liu's avatar
      [SPARK-5154] [PySpark] [Streaming] Kafka streaming support in Python · 0561c454
      Davies Liu authored
      This PR brings the Python API for Spark Streaming Kafka data source.
      
      ```
          class KafkaUtils(__builtin__.object)
           |  Static methods defined here:
           |
           |  createStream(ssc, zkQuorum, groupId, topics, storageLevel=StorageLevel(True, True, False, False,
      2), keyDecoder=<function utf8_decoder>, valueDecoder=<function utf8_decoder>)
           |      Create an input stream that pulls messages from a Kafka Broker.
           |
           |      :param ssc:  StreamingContext object
           |      :param zkQuorum:  Zookeeper quorum (hostname:port,hostname:port,..).
           |      :param groupId:  The group id for this consumer.
           |      :param topics:  Dict of (topic_name -> numPartitions) to consume.
           |                      Each partition is consumed in its own thread.
           |      :param storageLevel:  RDD storage level.
           |      :param keyDecoder:  A function used to decode key
           |      :param valueDecoder:  A function used to decode value
           |      :return: A DStream object
      ```
      run the example:
      
      ```
      bin/spark-submit --driver-class-path external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar examples/src/main/python/streaming/kafka_wordcount.py localhost:2181 test
      ```
      
      Author: Davies Liu <davies@databricks.com>
      Author: Tathagata Das <tdas@databricks.com>
      
      Closes #3715 from davies/kafka and squashes the following commits:
      
      d93bfe0 [Davies Liu] Update make-distribution.sh
      4280d04 [Davies Liu] address comments
      e6d0427 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
      f257071 [Davies Liu] add tests for null in RDD
      23b039a [Davies Liu] address comments
      9af51c4 [Davies Liu] Merge branch 'kafka' of github.com:davies/spark into kafka
      a74da87 [Davies Liu] address comments
      dc1eed0 [Davies Liu] Update kafka_wordcount.py
      31e2317 [Davies Liu] Update kafka_wordcount.py
      370ba61 [Davies Liu] Update kafka.py
      97386b3 [Davies Liu] address comment
      2c567a5 [Davies Liu] update logging and comment
      33730d1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
      adeeb38 [Davies Liu] Merge pull request #3 from tdas/kafka-python-api
      aea8953 [Tathagata Das] Kafka-assembly for Python API
      eea16a7 [Davies Liu] refactor
      f6ce899 [Davies Liu] add example and fix bugs
      98c8d17 [Davies Liu] fix python style
      5697a01 [Davies Liu] bypass decoder in scala
      048dbe6 [Davies Liu] fix python style
      75d485e [Davies Liu] add mqtt
      07923c4 [Davies Liu] support kafka in Python
      0561c454
    • Xiangrui Meng's avatar
      [SPARK-5540] hide ALS.solveLeastSquares · ef65cf09
      Xiangrui Meng authored
      This method survived the code review and it has been there since v1.1.0. It exposes jblas types. Let's remove it from the public API. I think no one calls it directly.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4318 from mengxr/SPARK-5540 and squashes the following commits:
      
      586ade6 [Xiangrui Meng] hide ALS.solveLeastSquares
      ef65cf09
    • Joseph K. Bradley's avatar
      [SPARK-5461] [graphx] Add isCheckpointed, getCheckpointedFiles methods to Graph · 842d0003
      Joseph K. Bradley authored
      Added the 2 methods to Graph and GraphImpl.  Both make calls to the underlying vertex and edge RDDs.
      
      This is needed for another PR (for LDA): [https://github.com/apache/spark/pull/4047]
      
      Notes:
      * getCheckpointedFiles is plural and returns a Seq[String] instead of an Option[String].
      * I attempted to test to make sure the methods returned the correct values after checkpointing.  It did not work; I guess that checkpointing does not occur quickly enough?  I noticed that there are not checkpointing tests for RDDs; is it just hard to test well?
      
      CC: rxin
      
      CC: mengxr  (since related to LDA)
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4253 from jkbradley/graphx-checkpoint and squashes the following commits:
      
      b680148 [Joseph K. Bradley] added class tag to firstParent call in VertexRDDImpl.isCheckpointed, though not needed to compile
      250810e [Joseph K. Bradley] In EdgeRDDImple, VertexRDDImpl, added transient back to partitionsRDD, and made isCheckpointed check firstParent instead of partitionsRDD
      695b7a3 [Joseph K. Bradley] changed partitionsRDD in EdgeRDDImpl, VertexRDDImpl to be non-transient
      cc00767 [Joseph K. Bradley] added overrides for isCheckpointed, getCheckpointFile in EdgeRDDImpl, VertexRDDImpl. The corresponding Graph methods now work.
      188665f [Joseph K. Bradley] improved documentation
      235738c [Joseph K. Bradley] Added isCheckpointed and getCheckpointFiles to Graph, GraphImpl
      842d0003
  34. Jan 28, 2015
    • Xiangrui Meng's avatar
      [SPARK-5430] move treeReduce and treeAggregate from mllib to core · 4ee79c71
      Xiangrui Meng authored
      We have seen many use cases of `treeAggregate`/`treeReduce` outside the ML domain. Maybe it is time to move them to Core. pwendell
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4228 from mengxr/SPARK-5430 and squashes the following commits:
      
      20ad40d [Xiangrui Meng] exclude tree* from mima
      e89a43e [Xiangrui Meng] fix compile and update java doc
      3ae1a4b [Xiangrui Meng] add treeReduce/treeAggregate to Python
      6f948c5 [Xiangrui Meng] add treeReduce/treeAggregate to JavaRDDLike
      d600b6c [Xiangrui Meng] move treeReduce and treeAggregate to core
      4ee79c71
Loading