Skip to content
Snippets Groups Projects
  1. Jul 13, 2015
    • Sun Rui's avatar
      [SPARK-6797] [SPARKR] Add support for YARN cluster mode. · 7f487c8b
      Sun Rui authored
      This PR enables SparkR to dynamically ship the SparkR binary package to the AM node in YARN cluster mode, thus it is no longer required that the SparkR package be installed on each worker node.
      
      This PR uses the JDK jar tool to package the SparkR package, because jar is thought to be available on both Linux/Windows platforms where JDK has been installed.
      
      This PR does not address the R worker involved in RDD API. Will address it in a separate JIRA issue.
      
      This PR does not address SBT build. SparkR installation and packaging by SBT will be addressed in a separate JIRA issue.
      
      R/install-dev.bat is not tested. shivaram , Could you help to test it?
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #6743 from sun-rui/SPARK-6797 and squashes the following commits:
      
      ca63c86 [Sun Rui] Adjust MimaExcludes after rebase.
      7313374 [Sun Rui] Fix unit test errors.
      72695fb [Sun Rui] Fix unit test failures.
      193882f [Sun Rui] Fix Mima test error.
      fe25a33 [Sun Rui] Fix Mima test error.
      35ecfa3 [Sun Rui] Fix comments.
      c38a005 [Sun Rui] Unzipped SparkR binary package is still required for standalone and Mesos modes.
      b05340c [Sun Rui] Fix scala style.
      2ca5048 [Sun Rui] Fix comments.
      1acefd1 [Sun Rui] Fix scala style.
      0aa1e97 [Sun Rui] Fix scala style.
      41d4f17 [Sun Rui] Add support for locating SparkR package for R workers required by RDD APIs.
      49ff948 [Sun Rui] Invoke jar.exe with full path in install-dev.bat.
      7b916c5 [Sun Rui] Use 'rem' consistently.
      3bed438 [Sun Rui] Add a comment.
      681afb0 [Sun Rui] Fix a bug that RRunner does not handle client deployment modes.
      cedfbe2 [Sun Rui] [SPARK-6797][SPARKR] Add support for YARN cluster mode.
      7f487c8b
  2. Jul 10, 2015
    • Jonathan Alter's avatar
      [SPARK-7977] [BUILD] Disallowing println · e14b545d
      Jonathan Alter authored
      Author: Jonathan Alter <jonalter@users.noreply.github.com>
      
      Closes #7093 from jonalter/SPARK-7977 and squashes the following commits:
      
      ccd44cc [Jonathan Alter] Changed println to log in ThreadingSuite
      7fcac3e [Jonathan Alter] Reverting to println in ThreadingSuite
      10724b6 [Jonathan Alter] Changing some printlns to logs in tests
      eeec1e7 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      0b1dcb4 [Jonathan Alter] More println cleanup
      aedaf80 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      925fd98 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      0c16fa3 [Jonathan Alter] Replacing some printlns with logs
      45c7e05 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      5c8e283 [Jonathan Alter] Allowing println in audit-release examples
      5b50da1 [Jonathan Alter] Allowing printlns in example files
      ca4b477 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      83ab635 [Jonathan Alter] Fixing new printlns
      54b131f [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      1cd8a81 [Jonathan Alter] Removing some unnecessary comments and printlns
      b837c3a [Jonathan Alter] Disallowing println
      e14b545d
  3. Jul 09, 2015
    • zsxwing's avatar
      [SPARK-8701] [STREAMING] [WEBUI] Add input metadata in the batch page · 1f6b0b12
      zsxwing authored
      This PR adds `metadata` to `InputInfo`. `InputDStream` can report its metadata for a batch and it will be shown in the batch page.
      
      For example,
      
      ![screen shot](https://cloud.githubusercontent.com/assets/1000778/8403741/d6ffc7e2-1e79-11e5-9888-c78c1575123a.png)
      
      FileInputDStream will display the new files for a batch, and DirectKafkaInputDStream will display its offset ranges.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #7081 from zsxwing/input-metadata and squashes the following commits:
      
      f7abd9b [zsxwing] Revert the space changes in project/MimaExcludes.scala
      d906209 [zsxwing] Merge branch 'master' into input-metadata
      74762da [zsxwing] Fix MiMa tests
      7903e33 [zsxwing] Merge branch 'master' into input-metadata
      450a46c [zsxwing] Address comments
      1d94582 [zsxwing] Raname InputInfo to StreamInputInfo and change "metadata" to Map[String, Any]
      d496ae9 [zsxwing] Add input metadata in the batch page
      1f6b0b12
  4. Jul 08, 2015
    • Davies Liu's avatar
      [SPARK-8450] [SQL] [PYSARK] cleanup type converter for Python DataFrame · 74d8d3d9
      Davies Liu authored
      This PR fixes the converter for Python DataFrame, especially for DecimalType
      
      Closes #7106
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7131 from davies/decimal_python and squashes the following commits:
      
      4d3c234 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
      20531d6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
      7d73168 [Davies Liu] fix conflit
      6cdd86a [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
      7104e97 [Davies Liu] improve type infer
      9cd5a21 [Davies Liu] run python tests with SPARK_PREPEND_CLASSES
      829a05b [Davies Liu] fix UDT in python
      c99e8c5 [Davies Liu] fix mima
      c46814a [Davies Liu] convert decimal for Python DataFrames
      74d8d3d9
    • Kousuke Saruta's avatar
      [SPARK-8914][SQL] Remove RDDApi · 2a4f88b6
      Kousuke Saruta authored
      As rxin suggested in #7298 , we should consider to remove `RDDApi`.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #7302 from sarutak/remove-rddapi and squashes the following commits:
      
      e495d35 [Kousuke Saruta] Fixed mima
      cb7ebb9 [Kousuke Saruta] Removed overriding RDDApi
      2a4f88b6
    • Cheng Lian's avatar
      [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for... · 4ffc27ca
      Cheng Lian authored
      [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for interoperability and backwards-compatibility
      
      This PR is a follow-up of #6617 and is part of [SPARK-6774] [2], which aims to ensure interoperability and backwards-compatibility for Spark SQL Parquet support.  And this one fixes the read path.  Now Spark SQL is expected to be able to read legacy Parquet data files generated by most (if not all) common libraries/tools like parquet-thrift, parquet-avro, and parquet-hive. However, we still need to refactor the write path to write standard Parquet LISTs and MAPs ([SPARK-8848] [4]).
      
      ### Major changes
      
      1. `CatalystConverter` class hierarchy refactoring
      
         - Replaces `CatalystConverter` trait with a much simpler `ParentContainerUpdater`.
      
           Now instead of extending the original `CatalystConverter` trait, every converter class accepts an updater which is responsible for propagating the converted value to some parent container. For example, appending array elements to a parent array buffer, appending a key-value pairs to a parent mutable map, or setting a converted value to some specific field of a parent row. Root converter doesn't have a parent and thus uses a `NoopUpdater`.
      
           This simplifies the design since converters don't need to care about details of their parent converters anymore.
      
         - Unifies `CatalystRootConverter`, `CatalystGroupConverter` and `CatalystPrimitiveRowConverter` into `CatalystRowConverter`
      
           Specifically, now all row objects are represented by `SpecificMutableRow` during conversion.
      
         - Refactors `CatalystArrayConverter`, and removes `CatalystArrayContainsNullConverter` and `CatalystNativeArrayConverter`
      
           `CatalystNativeArrayConverter` was probably designed with the intention of avoiding boxing costs. However, the way it uses Scala generics actually doesn't achieve this goal.
      
           The new `CatalystArrayConverter` handles both nullable and non-nullable array elements in a consistent way.
      
         - Implements backwards-compatibility rules in `CatalystArrayConverter`
      
           When Parquet records are being converted, schema of Parquet files should have already been verified. So we only need to care about the structure rather than field names in the Parquet schema. Since all map objects represented in legacy systems have the same structure as the standard one (see [backwards-compatibility rules for MAP] [1]), we only need to deal with LIST (namely array) in `CatalystArrayConverter`.
      
      2. Requested columns handling
      
         When specifying requested columns in `RowReadSupport`, we used to use a Parquet `MessageType` converted from a Catalyst `StructType` which contains all requested columns.  This is not preferable when taking compatibility and interoperability into consideration.  Because the actual Parquet file may have different physical structure from the converted schema.
      
         In this PR, the schema for requested columns is constructed using the following method:
      
         - For a column that exists in the target Parquet file, we extract the column type by name from the full file schema, and construct a single-field `MessageType` for that column.
         - For a column that doesn't exist in the target Parquet file, we create a single-field `StructType` and convert it to a `MessageType` using `CatalystSchemaConverter`.
         - Unions all single-field `MessageType`s into a full schema containing all requested fields
      
         With this change, we also fix [SPARK-6123] [3] by validating the global schema against each individual Parquet part-files.
      
      ### Testing
      
      This PR also adds compatibility tests for parquet-avro, parquet-thrift, and parquet-hive. Please refer to `README.md` under `sql/core/src/test` for more information about these tests. To avoid build time code generation and adding extra complexity to the build system, Java code generated from testing Thrift schema and Avro IDL is also checked in.
      
      [1]: https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1
      [2]: https://issues.apache.org/jira/browse/SPARK-6774
      [3]: https://issues.apache.org/jira/browse/SPARK-6123
      [4]: https://issues.apache.org/jira/browse/SPARK-8848
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7231 from liancheng/spark-6776 and squashes the following commits:
      
      360fe18 [Cheng Lian] Adds ParquetHiveCompatibilitySuite
      c6fbc06 [Cheng Lian] Removes WIP file committed by mistake
      b8c1295 [Cheng Lian] Excludes the whole parquet package from MiMa
      598c3e8 [Cheng Lian] Adds extra Maven repo for hadoop-lzo, which is a transitive dependency of parquet-thrift
      926af87 [Cheng Lian] Simplifies Parquet compatibility test suites
      7946ee1 [Cheng Lian] Fixes Scala styling issues
      3d7ab36 [Cheng Lian] Fixes .rat-excludes
      a8f13bb [Cheng Lian] Using Parquet writer API to do compatibility tests
      f2208cd [Cheng Lian] Adds README.md for Thrift/Avro code generation
      1d390aa [Cheng Lian] Adds parquet-thrift compatibility test
      440f7b3 [Cheng Lian] Adds generated files to .rat-excludes
      13b9121 [Cheng Lian] Adds ParquetAvroCompatibilitySuite
      06cfe9d [Cheng Lian] Adds comments about TimestampType handling
      a099d3e [Cheng Lian] More comments
      0cc1b37 [Cheng Lian] Fixes MiMa checks
      884d3e6 [Cheng Lian] Fixes styling issue and reverts unnecessary changes
      802cbd7 [Cheng Lian] Fixes bugs related to schema merging and empty requested columns
      38fe1e7 [Cheng Lian] Adds explicit return type
      7fb21f1 [Cheng Lian] Reverts an unnecessary debugging change
      1781dff [Cheng Lian] Adds test case for SPARK-8811
      6437d4b [Cheng Lian] Assembles requested schema from Parquet file schema
      bcac49f [Cheng Lian] Removes the 16-byte restriction of decimals
      a74fb2c [Cheng Lian] More comments
      0525346 [Cheng Lian] Removes old Parquet record converters
      03c3bd9 [Cheng Lian] Refactors Parquet read path to implement backwards-compatibility rules
      4ffc27ca
    • DB Tsai's avatar
      [SPARK-8700][ML] Disable feature scaling in Logistic Regression · 57221934
      DB Tsai authored
      All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same.
      
      In R, there is an option for this.
      `standardize`
      Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian".
      
      +cc holdenk mengxr jkbradley
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #7080 from dbtsai/lors and squashes the following commits:
      
      877e6c7 [DB Tsai] repahse the doc
      7cf45f2 [DB Tsai] address feedback
      78d75c9 [DB Tsai] small change
      c2c9e60 [DB Tsai] style
      6e1a8e0 [DB Tsai] first commit
      57221934
  5. Jul 03, 2015
  6. Jul 02, 2015
    • MechCoder's avatar
      [SPARK-8479] [MLLIB] Add numNonzeros and numActives to linalg.Matrices · 34d448db
      MechCoder authored
      Matrices allow zeros to be stored in values. Sometimes a method is handy to check if the numNonZeros are same as number of Active values.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6904 from MechCoder/nnz_matrix and squashes the following commits:
      
      252c6b7 [MechCoder] Add to MiMa excludes
      e2390f5 [MechCoder] Use count instead of foreach
      2f62b2f [MechCoder] Add to MiMa excludes
      d6e96ef [MechCoder] [SPARK-8479] Add numNonzeros and numActives to linalg.Matrices
      34d448db
  7. Jul 01, 2015
    • jerryshao's avatar
      [SPARK-7820] [BUILD] Fix Java8-tests suite compile and test error under sbt · 9f7db348
      jerryshao authored
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #7120 from jerryshao/SPARK-7820 and squashes the following commits:
      
      6902439 [jerryshao] fix Java8-tests suite compile error under sbt
      9f7db348
    • zsxwing's avatar
      [SPARK-8378] [STREAMING] Add the Python API for Flume · 75b9fe4c
      zsxwing authored
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6830 from zsxwing/flume-python and squashes the following commits:
      
      78dfdac [zsxwing] Fix the compile error in the test code
      f1bf3c0 [zsxwing] Address TD's comments
      0449723 [zsxwing] Add sbt goal streaming-flume-assembly/assembly
      e93736b [zsxwing] Fix the test case for determine_modules_to_test
      9d5821e [zsxwing] Fix pyspark_core dependencies
      f9ee681 [zsxwing] Merge branch 'master' into flume-python
      7a55837 [zsxwing] Add streaming_flume_assembly to run-tests.py
      b96b0de [zsxwing] Merge branch 'master' into flume-python
      ce85e83 [zsxwing] Fix incompatible issues for Python 3
      01cbb3d [zsxwing] Add import sys
      152364c [zsxwing] Fix the issue that StringIO doesn't work in Python 3
      14ba0ff [zsxwing] Add flume-assembly for sbt building
      b8d5551 [zsxwing] Merge branch 'master' into flume-python
      4762c34 [zsxwing] Fix the doc
      0336579 [zsxwing] Refactor Flume unit tests and also add tests for Python API
      9f33873 [zsxwing] Add the Python API for Flume
      75b9fe4c
  8. Jun 24, 2015
    • Cheng Lian's avatar
      [SPARK-6777] [SQL] Implements backwards compatibility rules in CatalystSchemaConverter · 8ab50765
      Cheng Lian authored
      This PR introduces `CatalystSchemaConverter` for converting Parquet schema to Spark SQL schema and vice versa.  Original conversion code in `ParquetTypesConverter` is removed. Benefits of the new version are:
      
      1. When converting Spark SQL schemas, it generates standard Parquet schemas conforming to [the most updated Parquet format spec] [1]. Converting to old style Parquet schemas is also supported via feature flag `spark.sql.parquet.followParquetFormatSpec` (which is set to `false` for now, and should be set to `true` after both read and write paths are fixed).
      
         Note that although this version of Parquet format spec hasn't been officially release yet, Parquet MR 1.7.0 already sticks to it. So it should be safe to follow.
      
      1. It implements backwards-compatibility rules described in the most updated Parquet format spec. Thus can recognize more schema patterns generated by other/legacy systems/tools.
      1. Code organization follows convention used in [parquet-mr] [2], which is easier to follow. (Structure of `CatalystSchemaConverter` is similar to `AvroSchemaConverter`).
      
      To fully implement backwards-compatibility rules in both read and write path, we also need to update `CatalystRowConverter` (which is responsible for converting Parquet records to `Row`s), `RowReadSupport`, and `RowWriteSupport`. These would be done in follow-up PRs.
      
      TODO
      
      - [x] More schema conversion test cases for legacy schema patterns.
      
      [1]: https://github.com/apache/parquet-format/blob/ea095226597fdbecd60c2419d96b54b2fdb4ae6c/LogicalTypes.md
      [2]: https://github.com/apache/parquet-mr/
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6617 from liancheng/spark-6777 and squashes the following commits:
      
      2a2062d [Cheng Lian] Don't convert decimals without precision information
      b60979b [Cheng Lian] Adds a constructor which accepts a Configuration, and fixes default value of assumeBinaryIsString
      743730f [Cheng Lian] Decimal scale shouldn't be larger than precision
      a104a9e [Cheng Lian] Fixes Scala style issue
      1f71d8d [Cheng Lian] Adds feature flag to allow falling back to old style Parquet schema conversion
      ba84f4b [Cheng Lian] Fixes MapType schema conversion bug
      13cb8d5 [Cheng Lian] Fixes MiMa failure
      81de5b0 [Cheng Lian] Fixes UDT, workaround read path, and add tests
      28ef95b [Cheng Lian] More AnalysisExceptions
      b10c322 [Cheng Lian] Replaces require() with analysisRequire() which throws AnalysisException
      cceaf3f [Cheng Lian] Implements backwards compatibility rules in CatalystSchemaConverter
      8ab50765
    • Josh Rosen's avatar
      [HOTFIX] [BUILD] Fix MiMa checks in master branch; enable MiMa for launcher project · 13ae806b
      Josh Rosen authored
      This commit changes the MiMa tests to test against the released 1.4.0 artifacts rather than 1.4.0-rc4; this change is necessary to fix a Jenkins build break since it seems that the RC4 snapshot is no longer available via Maven.
      
      I also enabled MiMa checks for the `launcher` subproject, which we should have done right after 1.4.0 was released.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6974 from JoshRosen/mima-hotfix and squashes the following commits:
      
      4b4175a [Josh Rosen] [HOTFIX] [BUILD] Fix MiMa checks in master branch; enable MiMa for launcher project
      13ae806b
  9. Jun 23, 2015
    • Holden Karau's avatar
      [SPARK-7888] Be able to disable intercept in linear regression in ml package · 2b1111dd
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #6927 from holdenk/SPARK-7888-Be-able-to-disable-intercept-in-Linear-Regression-in-ML-package and squashes the following commits:
      
      0ad384c [Holden Karau] Add MiMa excludes
      4016fac [Holden Karau] Switch to wild card import, remove extra blank lines
      ae5baa8 [Holden Karau] CR feedback, move the fitIntercept down rather than changing ymean and etc above
      f34971c [Holden Karau] Fix some more long lines
      319bd3f [Holden Karau] Fix long lines
      3bb9ee1 [Holden Karau] Update the regression suite tests
      7015b9f [Holden Karau] Our code performs the same with R, except we need more than one data point but that seems reasonable
      0b0c8c0 [Holden Karau] fix the issue with the sample R code
      e2140ba [Holden Karau] Add a test, it fails!
      5e84a0b [Holden Karau] Write out thoughts and use the correct trait
      91ffc0a [Holden Karau] more murh
      006246c [Holden Karau] murp?
      2b1111dd
  10. Jun 22, 2015
    • Davies Liu's avatar
      [SPARK-8307] [SQL] improve timestamp from parquet · 6b7f2cea
      Davies Liu authored
      This PR change to convert julian day to unix timestamp directly (without Calendar and Timestamp).
      
      cc adrian-wang rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6759 from davies/improve_ts and squashes the following commits:
      
      849e301 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      b0e4cad [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      8e2d56f [Davies Liu] address comments
      634b9f5 [Davies Liu] fix mima
      4891efb [Davies Liu] address comment
      bfc437c [Davies Liu] fix build
      ae5979c [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      602b969 [Davies Liu] remove jodd
      2f2e48c [Davies Liu] fix test
      8ace611 [Davies Liu] fix mima
      212143b [Davies Liu] fix mina
      c834108 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      a3171b8 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      5233974 [Davies Liu] fix scala style
      361fd62 [Davies Liu] address comments
      ea196d4 [Davies Liu] improve timestamp from parquet
      6b7f2cea
  11. Jun 19, 2015
    • cody koeninger's avatar
      [SPARK-8127] [STREAMING] [KAFKA] KafkaRDD optimize count() take() isEmpty() · 1b6fe9b1
      cody koeninger authored
      …ed KafkaRDD methods.  Possible fix for [SPARK-7122], but probably a worthwhile optimization regardless.
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #6632 from koeninger/kafka-rdd-count and squashes the following commits:
      
      321340d [cody koeninger] [SPARK-8127][Streaming][Kafka] additional test of ordering of take()
      5a05d0f [cody koeninger] [SPARK-8127][Streaming][Kafka] additional test of isEmpty
      f68bd32 [cody koeninger] [Streaming][Kafka][SPARK-8127] code cleanup
      9555b73 [cody koeninger] Merge branch 'master' into kafka-rdd-count
      253031d [cody koeninger] [Streaming][Kafka][SPARK-8127] mima exclusion for change to private method
      8974b9e [cody koeninger] [Streaming][Kafka][SPARK-8127] check offset ranges before constructing KafkaRDD
      c3768c5 [cody koeninger] [Streaming][Kafka] Take advantage of offset range info for size-related KafkaRDD methods.  Possible fix for [SPARK-7122], but probably a worthwhile optimization regardless.
      1b6fe9b1
  12. Jun 17, 2015
  13. Jun 16, 2015
    • Marcelo Vanzin's avatar
      [SPARK-8126] [BUILD] Make sure temp dir exists when running tests. · cebf2411
      Marcelo Vanzin authored
      If you ran "clean" at the top-level sbt project, the temp dir would
      go away, so running "test" without restarting sbt would fail. This
      fixes that by making sure the temp dir exists before running tests.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6805 from vanzin/SPARK-8126-fix and squashes the following commits:
      
      12d7768 [Marcelo Vanzin] [SPARK-8126] [build] Make sure temp dir exists when running tests.
      cebf2411
  14. Jun 11, 2015
    • Reynold Xin's avatar
      [SPARK-8286] Rewrite UTF8String in Java and move it into unsafe package. · 7d669a56
      Reynold Xin authored
      Unit test is still in Scala.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6738 from rxin/utf8string-java and squashes the following commits:
      
      562dc6e [Reynold Xin] Flag...
      98e600b [Reynold Xin] Another try with encoding setting ..
      cfa6bdf [Reynold Xin] Merge branch 'master' into utf8string-java
      a3b124d [Reynold Xin] Try different UTF-8 encoded characters.
      1ff7c82 [Reynold Xin] Enable UTF-8 encoding.
      82d58cc [Reynold Xin] Reset run-tests.
      2cb3c69 [Reynold Xin] Use utf-8 encoding in set bytes.
      53f8ef4 [Reynold Xin] Hack Jenkins to run one test.
      9a48e8d [Reynold Xin] Fixed runtime compilation error.
      911c450 [Reynold Xin] Moved unit test also to Java.
      4eff7bd [Reynold Xin] Improved unit test coverage.
      8e89a3c [Reynold Xin] Fixed tests.
      77c64bd [Reynold Xin] Fixed string type codegen.
      ffedb62 [Reynold Xin] Code review feedback.
      0967ce6 [Reynold Xin] Fixed import ordering.
      45a123d [Reynold Xin] [SPARK-8286] Rewrite UTF8String in Java and move it into unsafe package.
      7d669a56
    • Adam Roberts's avatar
      [SPARK-8289] Specify stack size for consistency with Java tests - resolves test failures · 6b68366d
      Adam Roberts authored
      This change is a simple one and specifies a stack size of 4096k instead of the vendor default for Java tests (the defaults vary between Java vendors). This remedies test failures observed with JavaALSSuite with IBM and Oracle Java owing to a lower default size in comparison to the size with OpenJDK. 4096k is a suitable default where the tests pass with each Java vendor tested. The alternative is to reduce the number of iterations in the test (no observed failures with 5 iterations instead of 15).
      
      -Xss works with Oracle's HotSpot VM, IBM's J9 VM and OpenJDK (IcedTea).
      
      I have ensured this does not have any negative implications for other tests.
      
      Author: Adam Roberts <aroberts@uk.ibm.com>
      Author: a-roberts <aroberts@uk.ibm.com>
      
      Closes #6727 from a-roberts/IncJavaStackSize and squashes the following commits:
      
      ab40aea [Adam Roberts] Specify stack size for SBT builds
      5032d8d [a-roberts] Update pom.xml
      6b68366d
  15. Jun 08, 2015
    • Marcelo Vanzin's avatar
      [SPARK-8126] [BUILD] Use custom temp directory during build. · a1d9e5cc
      Marcelo Vanzin authored
      Even with all the efforts to cleanup the temp directories created by
      unit tests, Spark leaves a lot of garbage in /tmp after a test run.
      This change overrides java.io.tmpdir to place those files under the
      build directory instead.
      
      After an sbt full unit test run, I was left with > 400 MB of temp
      files. Since they're now under the build dir, it's much easier to
      clean them up.
      
      Also make a slight change to a unit test to make it not pollute the
      source directory with test data.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6674 from vanzin/SPARK-8126 and squashes the following commits:
      
      0f8ad41 [Marcelo Vanzin] Make sure tmp dir exists when tests run.
      643e916 [Marcelo Vanzin] [MINOR] [BUILD] Use custom temp directory during build.
      a1d9e5cc
  16. Jun 07, 2015
    • cody koeninger's avatar
      [SPARK-2808] [STREAMING] [KAFKA] cleanup tests from · b127ff8a
      cody koeninger authored
      see if requiring producer acks eliminates the need for waitUntilLeaderOffset calls in tests
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #5921 from koeninger/kafka-0.8.2-test-cleanup and squashes the following commits:
      
      1e89dc8 [cody koeninger] Merge branch 'master' into kafka-0.8.2-test-cleanup
      4662828 [cody koeninger] [Streaming][Kafka] filter mima issue for removal of method from private test class
      af1e083 [cody koeninger] Merge branch 'master' into kafka-0.8.2-test-cleanup
      4298ac2 [cody koeninger] [Streaming][Kafka] update comment to trigger jenkins attempt
      1274afb [cody koeninger] [Streaming][Kafka] see if requiring producer acks eliminates the need for waitUntilLeaderOffset calls in tests
      b127ff8a
  17. Jun 05, 2015
    • Andrew Or's avatar
      Revert "[MINOR] [BUILD] Use custom temp directory during build." · 4036d05c
      Andrew Or authored
      This reverts commit b16b5434.
      4036d05c
    • Marcelo Vanzin's avatar
      [MINOR] [BUILD] Use custom temp directory during build. · b16b5434
      Marcelo Vanzin authored
      Even with all the efforts to cleanup the temp directories created by
      unit tests, Spark leaves a lot of garbage in /tmp after a test run.
      This change overrides java.io.tmpdir to place those files under the
      build directory instead.
      
      After an sbt full unit test run, I was left with > 400 MB of temp
      files. Since they're now under the build dir, it's much easier to
      clean them up.
      
      Also make a slight change to a unit test to make it not pollute the
      source directory with test data.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6653 from vanzin/unit-test-tmp and squashes the following commits:
      
      31e2dd5 [Marcelo Vanzin] Fix tests that depend on each other.
      aa92944 [Marcelo Vanzin] [minor] [build] Use custom temp directory during build.
      b16b5434
  18. Jun 04, 2015
    • Josh Rosen's avatar
      [SPARK-8106] [SQL] Set derby.system.durability=test to speed up Hive compatibility tests · 74dc2a90
      Josh Rosen authored
      Derby has a `derby.system.durability` configuration property that can be used to disable I/O synchronization calls for writes. This sacrifices durability but can result in large performance gains, which is appropriate for tests.
      
      We should enable this in our test system properties in order to speed up the Hive compatibility tests. I saw 2-3x speedups locally with this change.
      
      See https://db.apache.org/derby/docs/10.8/ref/rrefproperdurability.html for more documentation of this property.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6651 from JoshRosen/hive-compat-suite-speedup and squashes the following commits:
      
      b7a08a2 [Josh Rosen] Set derby.system.durability=test in our unit tests.
      74dc2a90
    • Reynold Xin's avatar
      [SPARK-7440][SQL] Remove physical Distinct operator in favor of Aggregate · 2bcdf8c2
      Reynold Xin authored
      This patch replaces Distinct with Aggregate in the optimizer, so Distinct will become
      more efficient over time as we optimize Aggregate (via Tungsten).
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6637 from rxin/replace-distinct and squashes the following commits:
      
      b3cc50e [Reynold Xin] Mima excludes.
      93d6117 [Reynold Xin] Code review feedback.
      87e4741 [Reynold Xin] [SPARK-7440][SQL] Remove physical Distinct operator in favor of Aggregate.
      2bcdf8c2
    • Davies Liu's avatar
      [SPARK-7956] [SQL] Use Janino to compile SQL expressions into bytecode · c8709dcf
      Davies Liu authored
      In order to reduce the overhead of codegen, this PR switch to use Janino to compile SQL expressions into bytecode.
      
      After this, the time used to compile a SQL expression is decreased from 100ms to 5ms, which is necessary to turn on codegen for general workload, also tests.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6479 from davies/janino and squashes the following commits:
      
      cc689f5 [Davies Liu] remove globalLock
      262d848 [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
      eec3a33 [Davies Liu] address comments from Josh
      f37c8c3 [Davies Liu] fix DecimalType and cast to String
      202298b [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
      a21e968 [Davies Liu] fix style
      0ed3dc6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
      551a851 [Davies Liu] fix tests
      c3bdffa [Davies Liu] remove print
      6089ce5 [Davies Liu] change logging level
      7e46ac3 [Davies Liu] fix style
      d8f0f6c [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
      da4926a [Davies Liu] fix tests
      03660f3 [Davies Liu] WIP: use Janino to compile Java source
      f2629cd [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
      f7d66cf [Davies Liu] use template based string for codegen
      c8709dcf
  19. Jun 03, 2015
    • Patrick Wendell's avatar
      [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0 · 2c4d550e
      Patrick Wendell authored
      Author: Patrick Wendell <patrick@databricks.com>
      
      Closes #6328 from pwendell/spark-1.5-update and squashes the following commits:
      
      2f42d02 [Patrick Wendell] A few more excludes
      4bebcf0 [Patrick Wendell] Update to RC4
      61aaf46 [Patrick Wendell] Using new release candidate
      55f1610 [Patrick Wendell] Another exclude
      04b4f04 [Patrick Wendell] More issues with transient 1.4 changes
      36f549b [Patrick Wendell] [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0
      2c4d550e
  20. May 30, 2015
  21. May 29, 2015
    • Holden Karau's avatar
      [SPARK-7910] [TINY] [JAVAAPI] expose partitioner information in javardd · 82a396c2
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #6464 from holdenk/SPARK-7910-expose-partitioner-information-in-javardd and squashes the following commits:
      
      de1e644 [Holden Karau] Fix the test to get the partitioner
      bdb31cc [Holden Karau] Add Mima exclude for the new method
      347ef4c [Holden Karau] Add a quick little test for the partitioner JavaAPI
      f49dca9 [Holden Karau] Add partitoner information to JavaRDDLike and fix some whitespace
      82a396c2
  22. May 24, 2015
  23. May 22, 2015
    • Michael Armbrust's avatar
      [SPARK-6743] [SQL] Fix empty projections of cached data · 3b68cb04
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #6165 from marmbrus/wrongColumn and squashes the following commits:
      
      4fad158 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into wrongColumn
      aad7eab [Michael Armbrust] rxins comments
      f1e8df1 [Michael Armbrust] [SPARK-6743][SQL] Fix empty projections of cached data
      3b68cb04
  24. May 19, 2015
    • Xiangrui Meng's avatar
      [SPARK-7681] [MLLIB] remove mima excludes for 1.3 · 6845cb2f
      Xiangrui Meng authored
      There excludes are unnecessary for 1.3 because the changes were made in 1.4.x.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6254 from mengxr/SPARK-7681-mima and squashes the following commits:
      
      7f0cea0 [Xiangrui Meng] remove mima excludes for 1.3
      6845cb2f
  25. May 18, 2015
    • Liang-Chi Hsieh's avatar
      [SPARK-7681] [MLLIB] Add SparseVector support for gemv · d03638cc
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7681
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6209 from viirya/sparsevector_gemv and squashes the following commits:
      
      ce0bb8b [Liang-Chi Hsieh] Still need to scal y when beta is 0.0 because it clears out y.
      b890e63 [Liang-Chi Hsieh] Do not delete multiply for DenseVector.
      57a8c1e [Liang-Chi Hsieh] Add MimaExcludes for v1.4.
      458d1ae [Liang-Chi Hsieh] List DenseMatrix.multiply and SparseMatrix.multiply to MimaExcludes too.
      054f05d [Liang-Chi Hsieh] Fix scala style.
      410381a [Liang-Chi Hsieh] Address comments. Make Matrix.multiply more generalized.
      4616696 [Liang-Chi Hsieh] Add support for SparseVector with SparseMatrix.
      5d6d07a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sparsevector_gemv
      c069507 [Liang-Chi Hsieh] Add SparseVector support for gemv with DenseMatrix.
      d03638cc
    • Rene Treffer's avatar
      [SPARK-6888] [SQL] Make the jdbc driver handling user-definable · e1ac2a95
      Rene Treffer authored
      Replace the DriverQuirks with JdbcDialect(s) (and MySQLDialect/PostgresDialect)
      and allow developers to change the dialects on the fly (for new JDBCRRDs only).
      
      Some types (like an unsigned 64bit number) can be trivially mapped to java.
      The status quo is that the RRD will fail to load.
      This patch makes it possible to overwrite the type mapping to read e.g.
      64Bit numbers as strings and handle them afterwards in software.
      
      JDBCSuite has an example that maps all types to String, which should always
      work (at the cost of extra code afterwards).
      
      As a side effect it should now be possible to develop simple dialects
      out-of-tree and even with spark-shell.
      
      Author: Rene Treffer <treffer@measite.de>
      
      Closes #5555 from rtreffer/jdbc-dialects and squashes the following commits:
      
      3cbafd7 [Rene Treffer] [SPARK-6888] ignore classes belonging to changed API in MIMA report
      fe7e2e8 [Rene Treffer] [SPARK-6888] Make the jdbc driver handling user-definable
      e1ac2a95
  26. May 13, 2015
    • Josh Rosen's avatar
      [SPARK-7081] Faster sort-based shuffle path using binary processing cache-aware sort · 73bed408
      Josh Rosen authored
      This patch introduces a new shuffle manager that enhances the existing sort-based shuffle with a new cache-friendly sort algorithm that operates directly on binary data. The goals of this patch are to lower memory usage and Java object overheads during shuffle and to speed up sorting. It also lays groundwork for follow-up patches that will enable end-to-end processing of serialized records.
      
      The new shuffle manager, `UnsafeShuffleManager`, can be enabled by setting `spark.shuffle.manager=tungsten-sort` in SparkConf.
      
      The new shuffle manager uses directly-managed memory to implement several performance optimizations for certain types of shuffles. In cases where the new performance optimizations cannot be applied, the new shuffle manager delegates to SortShuffleManager to handle those shuffles.
      
      UnsafeShuffleManager's optimizations will apply when _all_ of the following conditions hold:
      
       - The shuffle dependency specifies no aggregation or output ordering.
       - The shuffle serializer supports relocation of serialized values (this is currently supported
         by KryoSerializer and Spark SQL's custom serializers).
       - The shuffle produces fewer than 16777216 output partitions.
       - No individual record is larger than 128 MB when serialized.
      
      In addition, extra spill-merging optimizations are automatically applied when the shuffle compression codec supports concatenation of serialized streams. This is currently supported by Spark's LZF serializer.
      
      At a high-level, UnsafeShuffleManager's design is similar to Spark's existing SortShuffleManager.  In sort-based shuffle, incoming records are sorted according to their target partition ids, then written to a single map output file. Reducers fetch contiguous regions of this file in order to read their portion of the map output. In cases where the map output data is too large to fit in memory, sorted subsets of the output can are spilled to disk and those on-disk files are merged to produce the final output file.
      
      UnsafeShuffleManager optimizes this process in several ways:
      
       - Its sort operates on serialized binary data rather than Java objects, which reduces memory consumption and GC overheads. This optimization requires the record serializer to have certain properties to allow serialized records to be re-ordered without requiring deserialization.  See SPARK-4550, where this optimization was first proposed and implemented, for more details.
      
       - It uses a specialized cache-efficient sorter (UnsafeShuffleExternalSorter) that sorts arrays of compressed record pointers and partition ids. By using only 8 bytes of space per record in the sorting array, this fits more of the array into cache.
      
       - The spill merging procedure operates on blocks of serialized records that belong to the same partition and does not need to deserialize records during the merge.
      
       - When the spill compression codec supports concatenation of compressed data, the spill merge simply concatenates the serialized and compressed spill partitions to produce the final output partition.  This allows efficient data copying methods, like NIO's `transferTo`, to be used and avoids the need to allocate decompression or copying buffers during the merge.
      
      The shuffle read path is unchanged.
      
      This patch is similar to [SPARK-4550](http://issues.apache.org/jira/browse/SPARK-4550) / #4450 but uses a slightly different implementation. The `unsafe`-based implementation featured in this patch lays the groundwork for followup patches that will enable sorting to operate on serialized data pages that will be prepared by Spark SQL's new `unsafe` operators (such as the new aggregation operator introduced in #5725).
      
      ### Future work
      
      There are several tasks that build upon this patch, which will be left to future work:
      
      - [SPARK-7271](https://issues.apache.org/jira/browse/SPARK-7271) Redesign / extend the shuffle interfaces to accept binary data as input. The goal here is to let us bypass serialization steps in cases where the sort input is produced by an operator that operates directly on binary data.
      - Extension / redesign of the `Serializer` API. We can add new methods which allow serializers to determine the size requirements for serializing objects and for serializing objects directly to a specified memory address (similar to how `UnsafeRowConverter` works in Spark SQL).
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5868)
      <!-- Reviewable:end -->
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5868 from JoshRosen/unsafe-sort and squashes the following commits:
      
      ef0a86e [Josh Rosen] Fix scalastyle errors
      7610f2f [Josh Rosen] Add tests for proper cleanup of shuffle data.
      d494ffe [Josh Rosen] Fix deserialization of JavaSerializer instances.
      52a9981 [Josh Rosen] Fix some bugs in the address packing code.
      51812a7 [Josh Rosen] Change shuffle manager sort name to tungsten-sort
      4023fa4 [Josh Rosen] Add @Private annotation to some Java classes.
      de40b9d [Josh Rosen] More comments to try to explain metrics code
      df07699 [Josh Rosen] Attempt to clarify confusing metrics update code
      5e189c6 [Josh Rosen] Track time spend closing / flushing files; split TimeTrackingOutputStream into separate file.
      d5779c6 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
      c2ce78e [Josh Rosen] Fix a missed usage of MAX_PARTITION_ID
      e3b8855 [Josh Rosen] Cleanup in UnsafeShuffleWriter
      4a2c785 [Josh Rosen] rename 'sort buffer' to 'pointer array'
      6276168 [Josh Rosen] Remove ability to disable spilling in UnsafeShuffleExternalSorter.
      57312c9 [Josh Rosen] Clarify fileBufferSize units
      2d4e4f4 [Josh Rosen] Address some minor comments in UnsafeShuffleExternalSorter.
      fdcac08 [Josh Rosen] Guard against overflow when expanding sort buffer.
      85da63f [Josh Rosen] Cleanup in UnsafeShuffleSorterIterator.
      0ad34da [Josh Rosen] Fix off-by-one in nextInt() call
      56781a1 [Josh Rosen] Rename UnsafeShuffleSorter to UnsafeShuffleInMemorySorter
      e995d1a [Josh Rosen] Introduce MAX_SHUFFLE_OUTPUT_PARTITIONS.
      e58a6b4 [Josh Rosen] Add more tests for PackedRecordPointer encoding.
      4f0b770 [Josh Rosen] Attempt to implement proper shuffle write metrics.
      d4e6d89 [Josh Rosen] Update to bit shifting constants
      69d5899 [Josh Rosen] Remove some unnecessary override vals
      8531286 [Josh Rosen] Add tests that automatically trigger spills.
      7c953f9 [Josh Rosen] Add test that covers UnsafeShuffleSortDataFormat.swap().
      e1855e5 [Josh Rosen] Fix a handful of misc. IntelliJ inspections
      39434f9 [Josh Rosen] Avoid integer multiplication overflow in getMemoryUsage (thanks FindBugs!)
      1e3ad52 [Josh Rosen] Delete unused ByteBufferOutputStream class.
      ea4f85f [Josh Rosen] Roll back an unnecessary change in Spillable.
      ae538dc [Josh Rosen] Document UnsafeShuffleManager.
      ec6d626 [Josh Rosen] Add notes on maximum # of supported shuffle partitions.
      0d4d199 [Josh Rosen] Bump up shuffle.memoryFraction to make tests pass.
      b3b1924 [Josh Rosen] Properly implement close() and flush() in DummySerializerInstance.
      1ef56c7 [Josh Rosen] Revise compression codec support in merger; test cross product of configurations.
      b57c17f [Josh Rosen] Disable some overly-verbose logs that rendered DEBUG useless.
      f780fb1 [Josh Rosen] Add test demonstrating which compression codecs support concatenation.
      4a01c45 [Josh Rosen] Remove unnecessary log message
      27b18b0 [Josh Rosen] That for inserting records AT the max record size.
      fcd9a3c [Josh Rosen] Add notes + tests for maximum record / page sizes.
      9d1ee7c [Josh Rosen] Fix MiMa excludes for ShuffleWriter change
      fd4bb9e [Josh Rosen] Use own ByteBufferOutputStream rather than Kryo's
      67d25ba [Josh Rosen] Update Exchange operator's copying logic to account for new shuffle manager
      8f5061a [Josh Rosen] Strengthen assertion to check partitioning
      01afc74 [Josh Rosen] Actually read data in UnsafeShuffleWriterSuite
      1929a74 [Josh Rosen] Update to reflect upstream ShuffleBlockManager -> ShuffleBlockResolver rename.
      e8718dd [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
      9b7ebed [Josh Rosen] More defensive programming RE: cleaning up spill files and memory after errors
      7cd013b [Josh Rosen] Begin refactoring to enable proper tests for spilling.
      722849b [Josh Rosen] Add workaround for transferTo() bug in merging code; refactor tests.
      9883e30 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
      b95e642 [Josh Rosen] Refactor and document logic that decides when to spill.
      1ce1300 [Josh Rosen] More minor cleanup
      5e8cf75 [Josh Rosen] More minor cleanup
      e67f1ea [Josh Rosen] Remove upper type bound in ShuffleWriter interface.
      cfe0ec4 [Josh Rosen] Address a number of minor review comments:
      8a6fe52 [Josh Rosen] Rename UnsafeShuffleSpillWriter to UnsafeShuffleExternalSorter
      11feeb6 [Josh Rosen] Update TODOs related to shuffle write metrics.
      b674412 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
      aaea17b [Josh Rosen] Add comments to UnsafeShuffleSpillWriter.
      4f70141 [Josh Rosen] Fix merging; now passes UnsafeShuffleSuite tests.
      133c8c9 [Josh Rosen] WIP towards testing UnsafeShuffleWriter.
      f480fb2 [Josh Rosen] WIP in mega-refactoring towards shuffle-specific sort.
      57f1ec0 [Josh Rosen] WIP towards packed record pointers for use in optimized shuffle sort.
      69232fd [Josh Rosen] Enable compressible address encoding for off-heap mode.
      7ee918e [Josh Rosen] Re-order imports in tests
      3aeaff7 [Josh Rosen] More refactoring and cleanup; begin cleaning iterator interfaces
      3490512 [Josh Rosen] Misc. cleanup
      f156a8f [Josh Rosen] Hacky metrics integration; refactor some interfaces.
      2776aca [Josh Rosen] First passing test for ExternalSorter.
      5e100b2 [Josh Rosen] Super-messy WIP on external sort
      595923a [Josh Rosen] Remove some unused variables.
      8958584 [Josh Rosen] Fix bug in calculating free space in current page.
      f17fa8f [Josh Rosen] Add missing newline
      c2fca17 [Josh Rosen] Small refactoring of SerializerPropertiesSuite to enable test re-use:
      b8a09fe [Josh Rosen] Back out accidental log4j.properties change
      bfc12d3 [Josh Rosen] Add tests for serializer relocation property.
      240864c [Josh Rosen] Remove PrefixComputer and require prefix to be specified as part of insert()
      1433b42 [Josh Rosen] Store record length as int instead of long.
      026b497 [Josh Rosen] Re-use a buffer in UnsafeShuffleWriter
      0748458 [Josh Rosen] Port UnsafeShuffleWriter to Java.
      87e721b [Josh Rosen] Renaming and comments
      d3cc310 [Josh Rosen] Flag that SparkSqlSerializer2 supports relocation
      e2d96ca [Josh Rosen] Expand serializer API and use new function to help control when new UnsafeShuffle path is used.
      e267cee [Josh Rosen] Fix compilation of UnsafeSorterSuite
      9c6cf58 [Josh Rosen] Refactor to use DiskBlockObjectWriter.
      253f13e [Josh Rosen] More cleanup
      8e3ec20 [Josh Rosen] Begin code cleanup.
      4d2f5e1 [Josh Rosen] WIP
      3db12de [Josh Rosen] Minor simplification and sanity checks in UnsafeSorter
      767d3ca [Josh Rosen] Fix invalid range in UnsafeSorter.
      e900152 [Josh Rosen] Add test for empty iterator in UnsafeSorter
      57a4ea0 [Josh Rosen] Make initialSize configurable in UnsafeSorter
      abf7bfe [Josh Rosen] Add basic test case.
      81d52c5 [Josh Rosen] WIP on UnsafeSorter
      73bed408
    • Reynold Xin's avatar
      [SQL] Move some classes into packages that are more appropriate. · e683182c
      Reynold Xin authored
      JavaTypeInference into catalyst
      types.DateUtils into catalyst
      CacheManager into execution
      DefaultParserDialect into catalyst
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6108 from rxin/sql-rename and squashes the following commits:
      
      3fc9613 [Reynold Xin] Fixed import ordering.
      83d9ff4 [Reynold Xin] Fixed codegen tests.
      e271e86 [Reynold Xin] mima
      f4e24a6 [Reynold Xin] [SQL] Move some classes into packages that are more appropriate.
      e683182c
    • Cheng Lian's avatar
      [SPARK-7567] [SQL] Migrating Parquet data source to FSBasedRelation · 7ff16e8a
      Cheng Lian authored
      This PR migrates Parquet data source to the newly introduced `FSBasedRelation`. `FSBasedParquetRelation` is created to replace `ParquetRelation2`. Major differences are:
      
      1. Partition discovery code has been factored out to `FSBasedRelation`
      1. `AppendingParquetOutputFormat` is not used now. Instead, an anonymous subclass of `ParquetOutputFormat` is used to handle appending and writing dynamic partitions
      1. When scanning partitioned tables, `FSBasedParquetRelation.buildScan` only builds an `RDD[Row]` for a single selected partition
      1. `FSBasedParquetRelation` doesn't rely on Catalyst expressions for filter push down, thus it doesn't extend `CatalystScan` anymore
      
         After migrating `JSONRelation` (which extends `CatalystScan`), we can remove `CatalystScan`.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6090)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6090 from liancheng/parquet-migration and squashes the following commits:
      
      6063f87 [Cheng Lian] Casts to OutputCommitter rather than FileOutputCommtter
      bfd1cf0 [Cheng Lian] Fixes compilation error introduced while rebasing
      f9ea56e [Cheng Lian] Adds ParquetRelation2 related classes to MiMa check whitelist
      261d8c1 [Cheng Lian] Minor bug fix and more tests
      db65660 [Cheng Lian] Migrates Parquet data source to FSBasedRelation
      7ff16e8a
  27. May 12, 2015
    • Cheng Lian's avatar
      [SPARK-3928] [SPARK-5182] [SQL] Partitioning support for the data sources API · 0595b6de
      Cheng Lian authored
      This PR adds partitioning support for the external data sources API. It aims to simplify development of file system based data sources, and provide first class partitioning support for both read path and write path.  Existing data sources like JSON and Parquet can be simplified with this work.
      
      ## New features provided
      
      1. Hive compatible partition discovery
      
         This actually generalizes the partition discovery strategy used in Parquet data source in Spark 1.3.0.
      
      1. Generalized partition pruning optimization
      
         Now partition pruning is handled during physical planning phase.  Specific data sources don't need to worry about this harness anymore.
      
         (This also implies that we can remove `CatalystScan` after migrating the Parquet data source, since now we don't need to pass Catalyst expressions to data source implementations.)
      
      1. Insertion with dynamic partitions
      
         When inserting data to a `FSBasedRelation`, data can be partitioned dynamically by specified partition columns.
      
      ## New structures provided
      
      ### Developer API
      
      1. `FSBasedRelation`
      
         Base abstract class for file system based data sources.
      
      1. `OutputWriter`
      
         Base abstract class for output row writers, responsible for writing a single row object.
      
      1. `FSBasedRelationProvider`
      
         A new relation provider for `FSBasedRelation` subclasses. Note that data sources extending `FSBasedRelation` don't need to extend `RelationProvider` and `SchemaRelationProvider`.
      
      ### User API
      
      New overloaded versions of
      
      1. `DataFrame.save()`
      1. `DataFrame.saveAsTable()`
      1. `SQLContext.load()`
      
      are provided to allow users to save/load DataFrames with user defined dynamic partition columns.
      
      ### Spark SQL query planning
      
      1. `InsertIntoFSBasedRelation`
      
         Used to implement write path for `FSBasedRelation`s.
      
      1. New rules for `FSBasedRelation` in `DataSourceStrategy`
      
         These are added to hook `FSBasedRelation` into physical query plan in read path, and perform partition pruning.
      
      ## TODO
      
      - [ ] Use scratch directories when overwriting a table with data selected from itself.
      
            Currently, this is not supported, because the table been overwritten is always deleted before writing any data to it.
      
      - [ ] When inserting with dynamic partition columns, use external sorter to group the data first.
      
            This ensures that we only need to open a single `OutputWriter` at a time.  For data sources like Parquet, `OutputWriter`s can be quite memory consuming.  One issue is that, this approach breaks the row distribution in the original DataFrame.  However, we did't promise to preserve data distribution when writing a DataFrame.
      
      - [x] More tests.  Specifically, test cases for
      
            - [x] Self-join
            - [x] Loading partitioned relations with a subset of partition columns stored in data files.
            - [x] `SQLContext.load()` with user defined dynamic partition columns.
      
      ## Parquet data source migration
      
      Parquet data source migration is covered in PR https://github.com/liancheng/spark/pull/6, which is against this PR branch and for preview only. A formal PR need to be made after this one is merged.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #5526 from liancheng/partitioning-support and squashes the following commits:
      
      5351a1b [Cheng Lian] Fixes compilation error introduced while rebasing
      1f9b1a5 [Cheng Lian] Tweaks data schema passed to FSBasedRelations
      43ba50e [Cheng Lian] Avoids serializing generated projection code
      edf49e7 [Cheng Lian] Removed commented stale code block
      348a922 [Cheng Lian] Adds projection in FSBasedRelation.buildScan(requiredColumns, inputPaths)
      ad4d4de [Cheng Lian] Enables HDFS style globbing
      8d12e69 [Cheng Lian] Fixes compilation error
      c71ac6c [Cheng Lian] Addresses comments from @marmbrus
      7552168 [Cheng Lian] Fixes typo in MimaExclude.scala
      0349e09 [Cheng Lian] Fixes compilation error introduced while rebasing
      52b0c9b [Cheng Lian] Adjusts project/MimaExclude.scala
      c466de6 [Cheng Lian] Addresses comments
      bc3f9b4 [Cheng Lian] Uses projection to separate partition columns and data columns while inserting rows
      795920a [Cheng Lian] Fixes compilation error after rebasing
      0b8cd70 [Cheng Lian] Adds Scala/Catalyst row conversion when writing non-partitioned tables
      fa543f3 [Cheng Lian] Addresses comments
      5849dd0 [Cheng Lian] Fixes doc typos.  Fixes partition discovery refresh.
      51be443 [Cheng Lian] Replaces FSBasedRelation.outputCommitterClass with FSBasedRelation.prepareForWrite
      c4ed4fe [Cheng Lian] Bug fixes and a new test suite
      a29e663 [Cheng Lian] Bug fix: should only pass actuall data files to FSBaseRelation.buildScan
      5f423d3 [Cheng Lian] Bug fixes. Lets data source to customize OutputCommitter rather than OutputFormat
      54c3d7b [Cheng Lian] Enforces that FileOutputFormat must be used
      be0c268 [Cheng Lian] Uses TaskAttempContext rather than Configuration in OutputWriter.init
      0bc6ad1 [Cheng Lian] Resorts to new Hadoop API, and now FSBasedRelation can customize output format class
      f320766 [Cheng Lian] Adds prepareForWrite() hook, refactored writer containers
      422ff4a [Cheng Lian] Fixes style issue
      ce52353 [Cheng Lian] Adds new SQLContext.load() overload with user defined dynamic partition columns
      8d2ff71 [Cheng Lian] Merges partition columns when reading partitioned relations
      ca1805b [Cheng Lian] Removes duplicated partition discovery code in new Parquet
      f18dec2 [Cheng Lian] More strict schema checking
      b746ab5 [Cheng Lian] More tests
      9b487bf [Cheng Lian] Fixes compilation errors introduced while rebasing
      ea6c8dd [Cheng Lian] Removes remote debugging stuff
      327bb1d [Cheng Lian] Implements partitioning support for data sources API
      3c5073a [Cheng Lian] Fixes SaveModes used in test cases
      fb5a607 [Cheng Lian] Fixes compilation error
      9d17607 [Cheng Lian] Adds the contract that OutputWriter should have zero-arg constructor
      5de194a [Cheng Lian] Forgot Apache licence header
      95d0b4d [Cheng Lian] Renames PartitionedSchemaRelationProvider to FSBasedRelationProvider
      770b5ba [Cheng Lian] Adds tests for FSBasedRelation
      3ba9bbf [Cheng Lian] Adds DataFrame.saveAsTable() overrides which support partitioning
      1b8231f [Cheng Lian] Renames FSBasedPrunedFilteredScan to FSBasedRelation
      aa8ba9a [Cheng Lian] Javadoc fix
      012ed2d [Cheng Lian] Adds PartitioningOptions
      7dd8dd5 [Cheng Lian] Adds new interfaces and stub methods for data sources API partitioning support
      0595b6de
    • Marcelo Vanzin's avatar
      [SPARK-7485] [BUILD] Remove pyspark files from assembly. · 82e890fb
      Marcelo Vanzin authored
      The sbt part of the build is hacky; it basically tricks sbt
      into generating the zip by using a generator, but returns
      an empty list for the generated files so that nothing is
      actually added to the assembly.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6022 from vanzin/SPARK-7485 and squashes the following commits:
      
      22c1e04 [Marcelo Vanzin] Remove unneeded code.
      4893622 [Marcelo Vanzin] [SPARK-7485] [build] Remove pyspark files from assembly.
      82e890fb
Loading