Skip to content
Snippets Groups Projects
  1. Mar 23, 2016
    • sethah's avatar
      [SPARK-13952][ML] Add random seed to GBT · 69bc2c17
      sethah authored
      ## What changes were proposed in this pull request?
      
      `GBTClassifier` and `GBTRegressor` should use random seed for reproducible results. Because of the nature of current unit tests, which compare GBTs in ML and GBTs in MLlib for equality, I also added a random seed to MLlib GBT algorithm. I made alternate constructors in `mllib.tree.GradientBoostedTrees` to accept a random seed, but left them as private so as to not change the API unnecessarily.
      
      ## How was this patch tested?
      
      Existing unit tests verify that functionality did not change. Other ML algorithms do not seem to have unit tests that directly test the functionality of random seeding, but reproducibility with seeding for GBTs is effectively verified in existing tests. I can add more tests if needed.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #11903 from sethah/SPARK-13952.
      69bc2c17
    • Andrew Or's avatar
      [SPARK-14014][SQL] Replace existing catalog with SessionCatalog · 5dfc0197
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      `SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`.
      
      As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely:
      - SPARK-14013: Properly implement temporary functions in `SessionCatalog`
      - SPARK-13879: Decide which DDL/DML commands to support natively in Spark
      - SPARK-?????: Implement the ones we do want to support through `SessionCatalog`.
      - SPARK-?????: Merge SQL/HiveContext
      
      ## How was this patch tested?
      
      This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`.
      
      Author: Andrew Or <andrew@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #11836 from andrewor14/use-session-catalog.
      5dfc0197
    • Michael Armbrust's avatar
      [SPARK-14078] Streaming Parquet Based FileSink · 6bc4be64
      Michael Armbrust authored
      This PR adds a new `Sink` implementation that writes out Parquet files.  In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log.  When a parquet based `DataSource` is initialized for reading, we first check for this log directory and use it instead of file listing when present.
      
      Unit tests are added, as well as a stress test that checks the answer after non-deterministic injected failures.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #11897 from marmbrus/fileSink.
      6bc4be64
    • Herman van Hovell's avatar
      [SPARK-13325][SQL] Create a 64-bit hashcode expression · 919bf321
      Herman van Hovell authored
      This PR introduces a 64-bit hashcode expression. Such an expression is especially usefull for HyperLogLog++ and other probabilistic datastructures.
      
      I have implemented xxHash64 which is a 64-bit hashing algorithm created by Yann Colet and Mathias Westerdahl. This is a high speed (C implementation runs at memory bandwidth) and high quality hashcode. It exploits both Instruction Level Parralellism (for speed) and the multiplication and rotation techniques (for quality) like MurMurHash does.
      
      The initial results are promising. I have added a CG'ed test to the `HashBenchmark`, and this results in the following results (running from SBT):
      
          Running benchmark: Hash For simple
            Running case: interpreted version
            Running case: codegen version
            Running case: codegen version 64-bit
      
          Intel(R) Core(TM) i7-4750HQ CPU  2.00GHz
          Hash For simple:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
          -------------------------------------------------------------------------------------------
          interpreted version                      1011 / 1016        132.8           7.5       1.0X
          codegen version                          1864 / 1869         72.0          13.9       0.5X
          codegen version 64-bit                   1614 / 1644         83.2          12.0       0.6X
      
          Running benchmark: Hash For normal
            Running case: interpreted version
            Running case: codegen version
            Running case: codegen version 64-bit
      
          Intel(R) Core(TM) i7-4750HQ CPU  2.00GHz
          Hash For normal:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
          -------------------------------------------------------------------------------------------
          interpreted version                      2467 / 2475          0.9        1176.1       1.0X
          codegen version                          2008 / 2115          1.0         957.5       1.2X
          codegen version 64-bit                    728 /  758          2.9         347.0       3.4X
      
          Running benchmark: Hash For array
            Running case: interpreted version
            Running case: codegen version
            Running case: codegen version 64-bit
      
          Intel(R) Core(TM) i7-4750HQ CPU  2.00GHz
          Hash For array:                     Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
          -------------------------------------------------------------------------------------------
          interpreted version                      1544 / 1707          0.1       11779.6       1.0X
          codegen version                          2728 / 2745          0.0       20815.5       0.6X
          codegen version 64-bit                   2508 / 2549          0.1       19132.8       0.6X
      
          Running benchmark: Hash For map
            Running case: interpreted version
            Running case: codegen version
            Running case: codegen version 64-bit
      
          Intel(R) Core(TM) i7-4750HQ CPU  2.00GHz
          Hash For map:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
          -------------------------------------------------------------------------------------------
          interpreted version                      1819 / 1826          0.0      444014.3       1.0X
          codegen version                           183 /  194          0.0       44642.9       9.9X
          codegen version 64-bit                    173 /  174          0.0       42120.9      10.5X
      
      This shows that algorithm is consistently faster than MurMurHash32 in all cases and up to 3x (!) in the normal case.
      
      I have also added this to HyperLogLog++ and it cuts the processing time of the following code in half:
      
          val df = sqlContext.range(1<<25).agg(approxCountDistinct("id"))
          df.explain()
          val t = System.nanoTime()
          df.show()
          val ns = System.nanoTime() - t
      
          // Before
          ns: Long = 5821524302
      
          // After
          ns: Long = 2836418963
      
      cc cloud-fan (you have been working on hashcodes) / rxin
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #11209 from hvanhovell/xxHash.
      919bf321
    • Tathagata Das's avatar
      [SPARK-13809][SQL] State store for streaming aggregations · 8c826880
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      In this PR, I am implementing a new abstraction for management of streaming state data - State Store. It is a key-value store for persisting running aggregates for aggregate operations in streaming dataframes. The motivation and design is discussed here.
      
      https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254/edit#
      
      ## How was this patch tested?
      - [x] Unit tests
      - [x] Cluster tests
      
      **Coverage from unit tests**
      
      <img width="952" alt="screen shot 2016-03-21 at 3 09 40 pm" src="https://cloud.githubusercontent.com/assets/663212/13935872/fdc8ba86-ef76-11e5-93e8-9fa310472c7b.png">
      
      ## TODO
      - [x] Fix updates() iterator to avoid duplicate updates for same key
      - [x] Use Coordinator in ContinuousQueryManager
      - [x] Plugging in hadoop conf and other confs
      - [x] Unit tests
        - [x] StateStore object lifecycle and methods
        - [x] StateStoreCoordinator communication and logic
        - [x] StateStoreRDD fault-tolerance
        - [x] StateStoreRDD preferred location using StateStoreCoordinator
      - [ ] Cluster tests
        - [ ] Whether preferred locations are set correctly
        - [ ] Whether recovery works correctly with distributed storage
        - [x] Basic performance tests
      - [x] Docs
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #11645 from tdas/state-store.
      8c826880
    • Sameer Agarwal's avatar
      [SPARK-14015][SQL] Support TimestampType in vectorized parquet reader · 0a64294f
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This PR adds support for TimestampType in the vectorized parquet reader
      
      ## How was this patch tested?
      
      1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getPrimitiveTypeName() == PrimitiveType.PrimitiveTypeName.INT96)` that made us fall back on parquet-mr for handling timestamps. This condition is now removed.
      2. The `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `TimestampType`) fails when the gating condition is removed (https://github.com/apache/spark/pull/11808) and should now pass with this change. Similarly, the `ParquetHiveCompatibilitySuite.SPARK-10177 timestamp` test that fails when the gating condition is removed, should now pass as well.
      3.  Added tests in `HadoopFsRelationTest` that test both the dictionary encoded and non-encoded versions across all supported datatypes.
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #11882 from sameeragarwal/timestamp-parquet.
      0a64294f
    • Davies Liu's avatar
      [SPARK-14092] [SQL] move shouldStop() to end of while loop · 02d9c352
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR rollback some changes in #11274 , which introduced some performance regression when do a simple aggregation on parquet scan with one integer column.
      
      Does not really understand how this change introduce this huge impact, maybe related show JIT compiler inline functions. (saw very different stats from profiling).
      
      ## How was this patch tested?
      
      Manually run the parquet reader benchmark, before this change:
      ```
      Intel(R) Core(TM) i7-4558U CPU  2.80GHz
      Int and String Scan:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      -------------------------------------------------------------------------------------------
      SQL Parquet Vectorized                   2391 / 3107         43.9          22.8       1.0X
      ```
      After this change
      ```
      Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
      Intel(R) Core(TM) i7-4558U CPU  2.80GHz
      Int and String Scan:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      -------------------------------------------------------------------------------------------
      SQL Parquet Vectorized                   2032 / 2626         51.6          19.4       1.0X```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11912 from davies/fix_regression.
      02d9c352
    • sethah's avatar
      [SPARK-13068][PYSPARK][ML] Type conversion for Pyspark params · 30bdb5cb
      sethah authored
      ## What changes were proposed in this pull request?
      
      This patch adds type conversion functionality for parameters in Pyspark. A `typeConverter` field is added to the constructor of `Param` class. This argument is a function which converts values passed to this param to the appropriate type if possible. This is beneficial so that the params can fail at set time if they are given inappropriate values, but even more so because coherent error messages are now provided when Py4J cannot cast the python type to the appropriate Java type.
      
      This patch also adds a `TypeConverters` class with factory methods for common type conversions. Most of the changes involve adding these factory type converters to existing params. The previous solution to this issue, `expectedType`, is deprecated and can be removed in 2.1.0 as discussed on the Jira.
      
      ## How was this patch tested?
      
      Unit tests were added in python/pyspark/ml/tests.py to test parameter type conversion. These tests check that values that should be convertible are converted correctly, and that the appropriate errors are thrown when invalid values are provided.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #11663 from sethah/SPARK-13068-tc.
      30bdb5cb
    • Ernest's avatar
      [SPARK-14055] writeLocksByTask need to be update when removeBlock · 48ee16d8
      Ernest authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-14055
      
      ## How was this patch tested?
      
      manual tests by running LiveJournalPageRank on a large dataset ( the dataset must larger enough to incure RDD partition eviction).
      
      Author: Ernest <earneyzxl@gmail.com>
      
      Closes #11875 from Earne/issue-14055.
      48ee16d8
    • Josh Rosen's avatar
      [SPARK-14075] Refactor MemoryStore to be testable independent of BlockManager · 3de24ae2
      Josh Rosen authored
      This patch refactors the `MemoryStore` so that it can be tested without needing to construct / mock an entire `BlockManager`.
      
      - The block manager's serialization- and compression-related methods have been moved from `BlockManager` to `SerializerManager`.
      - `BlockInfoManager `is now passed directly to classes that need it, rather than being passed via the `BlockManager`.
      - The `MemoryStore` now calls `dropFromMemory` via a new `BlockEvictionHandler` interface rather than directly calling the `BlockManager`. This change helps to enforce a narrow interface between the `MemoryStore` and `BlockManager` functionality and makes this interface easier to mock in tests.
      - Several of the block unrolling tests have been moved from `BlockManagerSuite` into a new `MemoryStoreSuite`.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11899 from JoshRosen/reduce-memorystore-blockmanager-coupling.
      3de24ae2
    • gatorsmile's avatar
      [SPARK-13549][SQL] Refactor the Optimizer Rule CollapseProject · 6ce008ba
      gatorsmile authored
      #### What changes were proposed in this pull request?
      
      The PR https://github.com/apache/spark/pull/10541 changed the rule `CollapseProject` by enabling collapsing `Project` into `Aggregate`. It leaves a to-do item to remove the duplicate code. This PR is to finish this to-do item. Also added a test case for covering this change.
      
      #### How was this patch tested?
      
      Added a new test case.
      
      liancheng Could you check if the code refactoring is fine? Thanks!
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11427 from gatorsmile/collapseProjectRefactor.
      6ce008ba
    • Cheng Lian's avatar
      [SPARK-13817][SQL][MINOR] Renames Dataset.newDataFrame to Dataset.ofRows · cde086cb
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR does the renaming as suggested by marmbrus in [this comment][1].
      
      ## How was this patch tested?
      
      Existing tests.
      
      [1]: https://github.com/apache/spark/commit/6d37e1eb90054cdb6323b75fb202f78ece604b15#commitcomment-16654694
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #11889 from liancheng/spark-13817-follow-up.
      cde086cb
    • Sun Rui's avatar
      [SPARK-14074][SPARKR] Specify commit sha1 ID when using install_github to install intr package. · 7d117501
      Sun Rui authored
      ## What changes were proposed in this pull request?
      
      In dev/lint-r.R, `install_github` makes our builds depend on a unstable source. This may cause un-expected test failures and then build break. This PR adds a specified commit sha1 ID to `install_github` to get a stable source.
      
      ## How was this patch tested?
      dev/lint-r
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #11913 from sun-rui/SPARK-14074.
      7d117501
    • Joseph K. Bradley's avatar
      [SPARK-14035][MLLIB] Make error message more verbose for mllib NaiveBayesSuite · 4d955cd6
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Print more info about failed NaiveBayesSuite tests which have exhibited flakiness.
      
      ## How was this patch tested?
      
      Ran locally with incorrect check to cause failure.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #11858 from jkbradley/naive-bayes-bug-log.
      4d955cd6
    • Shixiong Zhu's avatar
      [HOTFIX][SQL] Don't stop ContinuousQuery in quietly · abacf5f2
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Try to fix a flaky hang
      
      ## How was this patch tested?
      
      Existing Jenkins test
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11909 from zsxwing/hotfix2.
      abacf5f2
    • Reynold Xin's avatar
      [SPARK-14088][SQL] Some Dataset API touch-up · 926a93e5
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      1. Deprecated unionAll. It is pretty confusing to have both "union" and "unionAll" when the two do the same thing in Spark but are different in SQL.
      2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more consistent with rest of the functions in KeyValueGroupedDataset. Also makes it more obvious what "reduce" and "reduceGroups" mean. Previously it was confusing because it could be reducing a Dataset, or just reducing groups.
      3. Added a "name" function, which is more natural to name columns than "as" for non-SQL users.
      4. Remove "subtract" function since it is just an alias for "except".
      
      ## How was this patch tested?
      All changes should be covered by existing tests. Also added couple test cases to cover "name".
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11908 from rxin/SPARK-14088.
      926a93e5
    • Dongjoon Hyun's avatar
      [MINOR][SQL][DOCS] Update `sql/README.md` and remove some unused imports in `sql` module. · 1a22cf1e
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR updates `sql/README.md` according to the latest console output and removes some unused imports in `sql` module. This is done by manually, so there is no guarantee to remove all unused imports.
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11907 from dongjoon-hyun/update_sql_module.
      1a22cf1e
  2. Mar 22, 2016
    • Yong Tang's avatar
      [SPARK-13401][SQL][TESTS] Fix SQL test warnings. · 75dc2962
      Yong Tang authored
      ## What changes were proposed in this pull request?
      
      This fix tries to fix several SQL test warnings under the sql/core/src/test directory. The fixed warnings includes "[unchecked]", "[rawtypes]", and "[varargs]".
      
      ## How was this patch tested?
      
      All existing tests passed.
      
      Author: Yong Tang <yong.tang.github@outlook.com>
      
      Closes #11857 from yongtang/SPARK-13401.
      75dc2962
    • Kazuaki Ishizaki's avatar
      [SPARK-14072][CORE] Show JVM/OS version information when we run a benchmark program · 0d51b604
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR allows us to identify what JVM is used when someone ran a benchmark program. In some cases, a JVM version may affect performance result. Thus, it would be good to show processor information and JVM version information.
      
      ```
      model name	: Intel(R) Xeon(R) CPU E5-2697 v2  2.70GHz
      JVM information : OpenJDK 64-Bit Server VM, 1.7.0_65-mockbuild_2014_07_14_06_19-b00
      Int and String Scan:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      -------------------------------------------------------------------------------------------
      SQL Parquet Vectorized                    981 /  994         10.7          93.5       1.0X
      SQL Parquet MR                           2518 / 2542          4.2         240.1       0.4X
      ```
      
      ```
      model name	: Intel(R) Xeon(R) CPU E5-2697 v2  2.70GHz
      JVM information : IBM J9 VM, pxa6480sr2-20151023_01 (SR2)
      String Dictionary:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      -------------------------------------------------------------------------------------------
      SQL Parquet Vectorized                    693 /  740         15.1          66.1       1.0X
      SQL Parquet MR                           2501 / 2562          4.2         238.5       0.3X
      ```
      
      ## How was this patch tested?
      
      Tested by using existing benchmark programs
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #11893 from kiszk/SPARK-14072.
      0d51b604
    • Davies Liu's avatar
      [SPARK-13806] [SQL] fix rounding mode of negative float/double · 4700adb9
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Round() in database usually round the number up (away from zero), it's different than Math.round() in Java.
      
      For example:
      ```
      scala> java.lang.Math.round(-3.5)
      res3: Long = -3
      ```
      In Database, we should return -4.0 in this cases.
      
      This PR remove the buggy special case for scale=0.
      
      ## How was this patch tested?
      
      Add tests for negative values with tie.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11894 from davies/fix_round.
      4700adb9
    • Shixiong Zhu's avatar
      [HOTFIX][SQL] Add a timeout for 'cq.stop' · d16710b4
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Fix an issue that DataFrameReaderWriterSuite may hang forever.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11902 from zsxwing/hotfix.
      d16710b4
    • Xusen Yin's avatar
      [SPARK-13449] Naive Bayes wrapper in SparkR · d6dc12ef
      Xusen Yin authored
      ## What changes were proposed in this pull request?
      
      This PR continues the work in #11486 from yinxusen with some code refactoring. In R package e1071, `naiveBayes` supports both categorical (Bernoulli) and continuous features (Gaussian), while in MLlib we support Bernoulli and multinomial. This PR implements the common subset: Bernoulli.
      
      I moved the implementation out from SparkRWrappers to NaiveBayesWrapper to make it easier to read. Argument names, default values, and summary now match e1071's naiveBayes.
      
      I removed the preprocess part that omit NA values because we don't know which columns to process.
      
      ## How was this patch tested?
      
      Test against output from R package e1071's naiveBayes.
      
      cc: yanboliang yinxusen
      
      Closes #11486
      
      Author: Xusen Yin <yinxusen@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #11890 from mengxr/SPARK-13449.
      d6dc12ef
    • Reynold Xin's avatar
      [SPARK-14060][SQL] Move StringToColumn implicit class into SQLImplicits · b2b1ad7d
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch moves StringToColumn implicit class into SQLImplicits. This was kept in SQLContext.implicits object for binary backward compatibility, in the Spark 1.x series. It makes more sense for this API to be in SQLImplicits since that's the single class that defines all the SQL implicits.
      
      ## How was this patch tested?
      Should be covered by existing unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11878 from rxin/SPARK-14060.
      b2b1ad7d
    • Joseph K. Bradley's avatar
      [SPARK-13951][ML][PYTHON] Nested Pipeline persistence · 7e3423b9
      Joseph K. Bradley authored
      Adds support for saving and loading nested ML Pipelines from Python.  Pipeline and PipelineModel do not extend JavaWrapper, but they are able to utilize the JavaMLWriter, JavaMLReader implementations.
      
      Also:
      * Separates out interfaces from Java wrapper implementations for MLWritable, MLReadable, MLWriter, MLReader.
      * Moves methods _stages_java2py, _stages_py2java into Pipeline, PipelineModel as _transfer_stage_from_java, _transfer_stage_to_java
      
      Added new unit test for nested Pipelines.  Abstracted validity check into a helper method for the 2 unit tests.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #11866 from jkbradley/nested-pipeline-io.
      Closes #11835
      7e3423b9
    • Reynold Xin's avatar
      [SPARK-14063][SQL] SQLContext.range should return Dataset[java.lang.Long] · 297c2022
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch changed the return type for SQLContext.range from `Dataset[Long]` (Scala primitive) to `Dataset[java.lang.Long]` (Java boxed long).
      
      Previously, SPARK-13894 changed the return type of range from `Dataset[Row]` to `Dataset[Long]`. The problem is that due to https://issues.scala-lang.org/browse/SI-4388, Scala compiles primitive types in generics into just Object, i.e. range at bytecode level now just returns `Dataset[Object]`. This is really bad for Java users because they are losing type safety and also need to add a type cast every time they use range.
      
      Talked to Jason Zaugg from Lightbend (Typesafe) who suggested the best approach is to return `Dataset[java.lang.Long]`. The downside is that when Scala users want to explicitly type a closure used on the dataset returned by range, they would need to use `java.lang.Long` instead of the Scala `Long`.
      
      ## How was this patch tested?
      The signature change should be covered by existing unit tests and API tests. I also added a new test case in DatasetSuite for range.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11880 from rxin/SPARK-14063.
      297c2022
    • Michael Armbrust's avatar
      [SPARK-13985][SQL] Deterministic batches with ids · caea1521
      Michael Armbrust authored
      This PR relaxes the requirements of a `Sink` for structured streaming to only require idempotent appending of data.  Previously the `Sink` needed to be able to transactionally append data while recording an opaque offset indicated how far in a stream we have processed.
      
      In order to do this, a new write-ahead-log has been added to stream execution, which records the offsets that will are present in each batch.  The log is created in the newly added `checkpointLocation`, which defaults to `${spark.sql.streaming.checkpointLocation}/${queryName}` but can be overriden by setting `checkpointLocation` in `DataFrameWriter`.
      
      In addition to making sinks easier to write the addition of batchIds and a checkpoint location is done in anticipation of integration with the the `StateStore` (#11645).
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #11804 from marmbrus/batchIds.
      caea1521
    • Dongjoon Hyun's avatar
      [SPARK-14029][SQL] Improve BooleanSimplification optimization by implementing... · c632bdc0
      Dongjoon Hyun authored
      [SPARK-14029][SQL] Improve BooleanSimplification optimization by implementing `Not` canonicalization.
      
      ## What changes were proposed in this pull request?
      
      Currently, **BooleanSimplification** optimization can handle the following cases.
      * a && (!a || b ) ==> a && b
      * a && (b || !a ) ==> a && b
      
      However, it can not handle the followings cases since those equations fail at the comparisons between their canonicalized forms.
      * a < 1 && (!(a < 1) || b)     ==> (a < 1) && b
      * a <= 1 && (!(a <= 1) || b) ==> (a <= 1) && b
      * a > 1 && (!(a > 1) || b)     ==> (a > 1) && b
      * a >= 1 && (!(a >= 1) || b) ==> (a >= 1) && b
      
      This PR implements the above cases and also the followings, too.
      * a < 1 && ((a >= 1) || b )   ==> (a < 1) && b
      * a <= 1 && ((a > 1) || b )   ==> (a <= 1) && b
      * a > 1 && ((a <= 1) || b)  ==> (a > 1) && b
      * a >= 1 && ((a < 1) || b)  ==> (a >= 1) && b
      
      ## How was this patch tested?
      
      Pass the Jenkins tests including new test cases in BooleanSimplicationSuite.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11851 from dongjoon-hyun/SPARK-14029.
      c632bdc0
    • Sunitha Kambhampati's avatar
      [SPARK-13774][SQL] - Improve error message for non-existent paths and add tests · 0ce01635
      Sunitha Kambhampati authored
      SPARK-13774: IllegalArgumentException: Can not create a Path from an empty string for incorrect file path
      
      **Overview:**
      -	If a non-existent path is given in this call
      ``
      scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv")
      ``
      it throws the following error:
      `java.lang.IllegalArgumentException: Can not create a Path from an empty string` …..
      `It gets called from inferSchema call in org.apache.spark.sql.execution.datasources.DataSource.resolveRelation`
      
      -	The purpose of this JIRA is to throw a better error message.
      -	With the fix, you will now get a _Path does not exist_ error message.
      ```
      scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv")
      org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/ksunitha/trunk/spark/file-path-is-incorrect.csv;
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:215)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:204)
        ...
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:204)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:131)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:141)
        ... 49 elided
      ```
      
      **Details**
      _Changes include:_
      -	Check if path exists or not in resolveRelation in DataSource, and throw an AnalysisException with message like “Path does not exist: $path”
      -	AnalysisException is thrown similar to the exceptions thrown in resolveRelation.
      -	The glob path and the non glob path is checked with minimal calls to path exists. If the globPath is empty, then it is a nonexistent glob pattern and an error will be thrown. In the scenario that it is not globPath, it is necessary to only check if the first element in the Seq is valid or not.
      
      _Test modifications:_
      -	Changes went in for 3 tests to account for this error checking.
      -	SQLQuerySuite:test("run sql directly on files") – Error message needed to be updated.
      -	2 tests failed in MetastoreDataSourcesSuite because they had a dummy path and so test is modified to give a tempdir and allow it to move past so it can continue to test the codepath it meant to test
      
      _New Tests:_
      2 new tests are added to DataFrameSuite to validate that glob and non-glob path will throw the new error message.
      
      _Testing:_
      Unit tests were run with the fix.
      
      **Notes/Questions to reviewers:**
      -	There is some code duplication in DataSource.scala in resolveRelation method and also createSource with respect to getting the paths.  I have not made any changes to the createSource codepath.  Should we make the change there as well ?
      
      -	From other JIRAs, I know there is restructuring and changes going on in this area, not sure how that will affect these changes, but since this seemed like a starter issue, I looked into it.  If we prefer not to add the overhead of the checks, or if there is a better place to do so, let me know.
      
      I would appreciate your review. Thanks for your time and comments.
      
      Author: Sunitha Kambhampati <skambha@us.ibm.com>
      
      Closes #11775 from skambha/improve_errmsg.
      0ce01635
    • hyukjinkwon's avatar
      [SPARK-13953][SQL] Specifying the field name for corrupted record via option at JSON datasource · 4e09a0d5
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-13953
      
      Currently, JSON data source creates a new field in `PERMISSIVE` mode for storing malformed string.
      This field can be renamed via `spark.sql.columnNameOfCorruptRecord` option but it is a global configuration.
      
      This PR make that option can be applied per read and can be specified via `option()`. This will overwrites `spark.sql.columnNameOfCorruptRecord` if it is set.
      
      ## How was this patch tested?
      
      Unit tests were used and `./dev/run_tests` for coding style tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11881 from HyukjinKwon/SPARK-13953.
      4e09a0d5
    • Cheng Lian's avatar
      [SPARK-13473][SQL] Simplifies PushPredicateThroughProject · f2e855fb
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up of PR #11348.
      
      After PR #11348, a predicate is never pushed through a project as long as the project contains any non-deterministic fields. Thus, it's impossible that the candidate filter condition can reference any non-deterministic projected fields, and related logic can be safely cleaned up.
      
      To be more specific, the following optimization is allowed:
      
      ```scala
      // From:
      df.select('a, 'b).filter('c > rand(42))
      // To:
      df.filter('c > rand(42)).select('a, 'b)
      ```
      
      while this isn't:
      
      ```scala
      // From:
      df.select('a, rand('b) as 'rb, 'c).filter('c > 'rb)
      // To:
      df.filter('c > rand('b)).select('a, rand('b) as 'rb, 'c)
      ```
      
      ## How was this patch tested?
      
      Existing test cases should do the work.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #11864 from liancheng/spark-13473-cleanup.
      f2e855fb
    • Wenchen Fan's avatar
      [SPARK-14038][SQL] enable native view by default · 14464cad
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      As we have completed the `SQLBuilder`, we can safely turn on native view by default.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11872 from cloud-fan/native-view.
      14464cad
    • zero323's avatar
      [SPARK-14058][PYTHON] Incorrect docstring in Window.order · 8193a266
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Replaces current docstring ("Creates a :class:`WindowSpec` with the partitioning defined.") with "Creates a :class:`WindowSpec` with the ordering defined."
      
      ## How was this patch tested?
      
      PySpark unit tests (no regression introduced). No changes to the code.
      
      Author: zero323 <matthew.szymkiewicz@gmail.com>
      
      Closes #11877 from zero323/order-by-description.
      8193a266
  3. Mar 21, 2016
    • Michael Armbrust's avatar
      [SPARK-13883][SQL] Parquet Implementation of FileFormat.buildReader · 8014a516
      Michael Armbrust authored
      This PR add implements the new `buildReader` interface for the Parquet `FileFormat`.  An simple implementation of `FileScanRDD` is also included.
      
      This code should be tested by the many existing tests for parquet.
      
      Author: Michael Armbrust <michael@databricks.com>
      Author: Sameer Agarwal <sameer@databricks.com>
      Author: Nong Li <nong@databricks.com>
      
      Closes #11709 from marmbrus/parquetReader.
      8014a516
    • Sameer Agarwal's avatar
      [SPARK-14016][SQL] Support high-precision decimals in vectorized parquet reader · 72999616
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This patch adds support for reading `DecimalTypes` with high (> 18) precision in `VectorizedColumnReader`
      
      ## How was this patch tested?
      
      1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getDecimalMetadata().getPrecision() > Decimal.MAX_LONG_DIGITS()` that made us fall back on parquet-mr for handling high-precision decimals. This condition is now removed.
      2. In particular, the `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `DecimalType(25, 5)`) fails when the gating condition is removed (https://github.com/apache/spark/pull/11808) and should now pass with this change.
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #11869 from sameeragarwal/bigdecimal-parquet.
      72999616
    • Xiangrui Meng's avatar
    • gatorsmile's avatar
      [SPARK-13320][SQL] Support Star in CreateStruct/CreateArray and Error Handling... · 3f49e076
      gatorsmile authored
      [SPARK-13320][SQL] Support Star in CreateStruct/CreateArray and Error Handling when DataFrame/DataSet Functions using Star
      
      This PR resolves two issues:
      
      First, expanding * inside aggregate functions of structs when using Dataframe/Dataset APIs. For example,
      ```scala
      structDf.groupBy($"a").agg(min(struct($"record.*")))
      ```
      
      Second, it improves the error messages when having invalid star usage when using Dataframe/Dataset APIs. For example,
      ```scala
      pagecounts4PartitionsDS
        .map(line => (line._1, line._3))
        .toDF()
        .groupBy($"_1")
        .agg(sum("*") as "sumOccurances")
      ```
      Before the fix, the invalid usage will issue a confusing error message, like:
      ```
      org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns _1, _2;
      ```
      After the fix, the message is like:
      ```
      org.apache.spark.sql.AnalysisException: Invalid usage of '*' in function 'sum'
      ```
      cc: rxin nongli cloud-fan
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11208 from gatorsmile/sumDataSetResolution.
      3f49e076
    • Josh Rosen's avatar
      [SPARK-13990] Automatically pick serializer when caching RDDs · b5f1ab70
      Josh Rosen authored
      Building on the `SerializerManager` introduced in SPARK-13926/ #11755, this patch Spark modifies Spark's BlockManager to use RDD's ClassTags in order to select the best serializer to use when caching RDD blocks.
      
      When storing a local block, the BlockManager `put()` methods use implicits to record ClassTags and stores those tags in the blocks' BlockInfo records. When reading a local block, the stored ClassTag is used to pick the appropriate serializer. When a block is stored with replication, the class tag is written into the block transfer metadata and will also be stored in the remote BlockManager.
      
      There are two or three places where we don't properly pass ClassTags, including TorrentBroadcast and BlockRDD. I think this happens to work because the missing ClassTag always happens to be `ClassTag.Any`, but it might be worth looking more carefully at those places to see whether we should be more explicit.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11801 from JoshRosen/pick-best-serializer-for-caching.
      b5f1ab70
    • Reynold Xin's avatar
      [SPARK-13898][SQL] Merge DatasetHolder and DataFrameHolder · b3e5af62
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch merges DatasetHolder and DataFrameHolder. This makes more sense because DataFrame/Dataset are now one class.
      
      In addition, fixed some minor issues with pull request #11732.
      
      ## How was this patch tested?
      Updated existing unit tests that test these implicits.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11737 from rxin/SPARK-13898.
      b3e5af62
    • Nong Li's avatar
      [SPARK-13916][SQL] Add a metric to WholeStageCodegen to measure duration. · 5e86e926
      Nong Li authored
      ## What changes were proposed in this pull request?
      
      WholeStageCodegen naturally breaks the execution into pipelines that are easier to
      measure duration. This is more granular than the task timings (a task can be multiple
      pipelines) and is integrated with the web ui.
      
      We currently report total time (across all tasks), min/mask/median to get a sense of how long each is taking.
      
      ## How was this patch tested?
      
      Manually tested looking at the web ui.
      
      Author: Nong Li <nong@databricks.com>
      
      Closes #11741 from nongli/spark-13916.
      5e86e926
    • Xin Ren's avatar
      [SPARK-13019][DOCS] Replace example code in mllib-statistics.md using include_example · 1af8de20
      Xin Ren authored
      https://issues.apache.org/jira/browse/SPARK-13019
      
      The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.
      
      Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
      `{% include_example scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}`
      Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala` and pick code blocks marked "example" and replace code block in
      `{% highlight %}`
       in the markdown.
      
      See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #11108 from keypointt/SPARK-13019.
      1af8de20
Loading