Skip to content
Snippets Groups Projects
  1. Dec 06, 2016
    • Tathagata Das's avatar
      [SPARK-18671][SS][TEST-MAVEN] Follow up PR to fix test for Maven · 3750c6e9
      Tathagata Das authored
      
      ## What changes were proposed in this pull request?
      
      Maven compilation seem to not allow resource is sql/test to be easily referred to in kafka-0-10-sql tests. So moved the kafka-source-offset-version-2.1.0 from sql test resources to kafka-0-10-sql test resources.
      
      ## How was this patch tested?
      
      Manually ran maven test
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16183 from tdas/SPARK-18671-1.
      
      (cherry picked from commit 5c6bcdbd)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      3750c6e9
    • Tathagata Das's avatar
      [SPARK-18734][SS] Represent timestamp in StreamingQueryProgress as formatted... · 9b5bc2a6
      Tathagata Das authored
      [SPARK-18734][SS] Represent timestamp in StreamingQueryProgress as formatted string instead of millis
      
      ## What changes were proposed in this pull request?
      
      Easier to read while debugging as a formatted string (in ISO8601 format) than in millis
      
      ## How was this patch tested?
      Updated unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16166 from tdas/SPARK-18734.
      
      (cherry picked from commit 539bb3cf)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      9b5bc2a6
    • Shuai Lin's avatar
      [SPARK-18652][PYTHON] Include the example data and third-party licenses in pyspark package. · 65f5331a
      Shuai Lin authored
      
      ## What changes were proposed in this pull request?
      
      Since we already include the python examples in the pyspark package, we should include the example data with it as well.
      
      We should also include the third-party licences since we distribute their jars with the pyspark package.
      
      ## How was this patch tested?
      
      Manually tested with python2.7 and python3.4
      ```sh
      $ ./build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Pmesos clean package
      $ cd python
      $ python setup.py sdist
      $ pip install  dist/pyspark-2.1.0.dev0.tar.gz
      
      $ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/data/
      graphx
      mllib
      streaming
      
      $ du -sh /usr/local/lib/python2.7/dist-packages/pyspark/data/
      600K    /usr/local/lib/python2.7/dist-packages/pyspark/data/
      
      $ ls -1  /usr/local/lib/python2.7/dist-packages/pyspark/licenses/|head -5
      LICENSE-AnchorJS.txt
      LICENSE-DPark.txt
      LICENSE-Mockito.txt
      LICENSE-SnapTree.txt
      LICENSE-antlr.txt
      ```
      
      Author: Shuai Lin <linshuai2012@gmail.com>
      
      Closes #16082 from lins05/include-data-in-pyspark-dist.
      
      (cherry picked from commit bd9a4a5a)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      65f5331a
    • Tathagata Das's avatar
      [SPARK-18671][SS][TEST] Added tests to ensure stability of that all Structured... · d20e0d6b
      Tathagata Das authored
      [SPARK-18671][SS][TEST] Added tests to ensure stability of that all Structured Streaming log formats
      
      ## What changes were proposed in this pull request?
      
      To be able to restart StreamingQueries across Spark version, we have already made the logs (offset log, file source log, file sink log) use json. We should added tests with actual json files in the Spark such that any incompatible changes in reading the logs is immediately caught. This PR add tests for FileStreamSourceLog, FileStreamSinkLog, and OffsetSeqLog.
      
      ## How was this patch tested?
      new unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16128 from tdas/SPARK-18671.
      
      (cherry picked from commit 1ef6b296)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      d20e0d6b
    • Reynold Xin's avatar
      [SPARK-18714][SQL] Add a simple time function to SparkSession · ace4079c
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      Many Spark developers often want to test the runtime of some function in interactive debugging and testing. This patch adds a simple time function to SparkSession:
      
      ```
      scala> spark.time { spark.range(1000).count() }
      Time taken: 77 ms
      res1: Long = 1000
      ```
      
      ## How was this patch tested?
      I tested this interactively in spark-shell.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16140 from rxin/SPARK-18714.
      
      (cherry picked from commit cb1f10b4)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      ace4079c
    • Herman van Hovell's avatar
      [SPARK-18634][SQL][TRIVIAL] Touch-up Generate · e362d998
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      I jumped the gun on merging https://github.com/apache/spark/pull/16120
      
      , and missed a tiny potential problem. This PR fixes that by changing a val into a def; this should prevent potential serialization/initialization weirdness from happening.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16170 from hvanhovell/SPARK-18634.
      
      (cherry picked from commit 381ef4ea)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      e362d998
  2. Dec 05, 2016
    • Shixiong Zhu's avatar
      [SPARK-18721][SS] Fix ForeachSink with watermark + append · 655297b3
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      Right now ForeachSink creates a new physical plan, so StreamExecution cannot retrieval metrics and watermark.
      
      This PR changes ForeachSink to manually convert InternalRows to objects without creating a new plan.
      
      ## How was this patch tested?
      
      `test("foreach with watermark: append")`.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16160 from zsxwing/SPARK-18721.
      
      (cherry picked from commit 7863c623)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      655297b3
    • Michael Allman's avatar
      [SPARK-18572][SQL] Add a method `listPartitionNames` to `ExternalCatalog` · 8ca6a82c
      Michael Allman authored
      (Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-18572
      
      )
      
      ## What changes were proposed in this pull request?
      
      Currently Spark answers the `SHOW PARTITIONS` command by fetching all of the table's partition metadata from the external catalog and constructing partition names therefrom. The Hive client has a `getPartitionNames` method which is many times faster for this purpose, with the performance improvement scaling with the number of partitions in a table.
      
      To test the performance impact of this PR, I ran the `SHOW PARTITIONS` command on two Hive tables with large numbers of partitions. One table has ~17,800 partitions, and the other has ~95,000 partitions. For the purposes of this PR, I'll call the former table `table1` and the latter table `table2`. I ran 5 trials for each table with before-and-after versions of this PR. The results are as follows:
      
      Spark at bdc8153e, `SHOW PARTITIONS table1`, times in seconds:
      7.901
      3.983
      4.018
      4.331
      4.261
      
      Spark at bdc8153e, `SHOW PARTITIONS table2`
      (Timed out after 10 minutes with a `SocketTimeoutException`.)
      
      Spark at this PR, `SHOW PARTITIONS table1`, times in seconds:
      3.801
      0.449
      0.395
      0.348
      0.336
      
      Spark at this PR, `SHOW PARTITIONS table2`, times in seconds:
      5.184
      1.63
      1.474
      1.519
      1.41
      
      Taking the best times from each trial, we get a 12x performance improvement for a table with ~17,800 partitions and at least a 426x improvement for a table with ~95,000 partitions. More significantly, the latter command doesn't even complete with the current code in master.
      
      This is actually a patch we've been using in-house at VideoAmp since Spark 1.1. It's made all the difference in the practical usability of our largest tables. Even with tables with about 1,000 partitions there's a performance improvement of about 2-3x.
      
      ## How was this patch tested?
      
      I added a unit test to `VersionsSuite` which tests that the Hive client's `getPartitionNames` method returns the correct number of partitions.
      
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #15998 from mallman/spark-18572-list_partition_names.
      
      (cherry picked from commit 772ddbea)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      8ca6a82c
    • Shixiong Zhu's avatar
      [SPARK-18722][SS] Move no data rate limit from StreamExecution to ProgressReporter · d4588165
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      Move no data rate limit from StreamExecution to ProgressReporter to make `recentProgresses` and listener events consistent.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16155 from zsxwing/SPARK-18722.
      
      (cherry picked from commit 4af142f5)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      d4588165
    • Tathagata Das's avatar
      [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and... · 1946854a
      Tathagata Das authored
      [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and not auto-generate StreamingQuery.name
      
      Here are the major changes in this PR.
      - Added the ability to recover `StreamingQuery.id` from checkpoint location, by writing the id to `checkpointLoc/metadata`.
      - Added `StreamingQuery.runId` which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of `id`).
      - Removed auto-generation of `StreamingQuery.name`. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default.
      - Added `runId` to `StreamingQueryListener` events and `StreamingQueryProgress`.
      
      Implementation details
      - Renamed existing `StreamExecutionMetadata` to `OffsetSeqMetadata`, and moved it to the file `OffsetSeq.scala`, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of `.json` and `.getOrElse("{}")`).
      - Added the `id` as the new `StreamMetadata`.
      - When a StreamingQuery is created it gets or writes the `StreamMetadata` from `checkpointLoc/metadata`.
      - All internal logging in `StreamExecution` uses `(name, id, runId)` instead of just `name`
      
      TODO
      - [x] Test handling of name=null in json generation of StreamingQueryProgress
      - [x] Test handling of name=null in json generation of StreamingQueryListener events
      - [x] Test python API of runId
      
      Updated unit tests and new unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16113 from tdas/SPARK-18657.
      
      (cherry picked from commit bb57bfe9)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      1946854a
    • Shixiong Zhu's avatar
      [SPARK-18729][SS] Move DataFrame.collect out of synchronized block in MemorySink · 6c4c3368
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      Move DataFrame.collect out of synchronized block so that we can query content in MemorySink when `DataFrame.collect` is running.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16162 from zsxwing/SPARK-18729.
      
      (cherry picked from commit 1b2785c3)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      6c4c3368
    • Liang-Chi Hsieh's avatar
      [SPARK-18634][PYSPARK][SQL] Corruption and Correctness issues with exploding Python UDFs · fecd23d2
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      As reported in the Jira, there are some weird issues with exploding Python UDFs in SparkSQL.
      
      The following test code can reproduce it. Notice: the following test code is reported to return wrong results in the Jira. However, as I tested on master branch, it causes exception and so can't return any result.
      
          >>> from pyspark.sql.functions import *
          >>> from pyspark.sql.types import *
          >>>
          >>> df = spark.range(10)
          >>>
          >>> def return_range(value):
          ...   return [(i, str(i)) for i in range(value - 1, value + 1)]
          ...
          >>> range_udf = udf(return_range, ArrayType(StructType([StructField("integer_val", IntegerType()),
          ...                                                     StructField("string_val", StringType())])))
          >>>
          >>> df.select("id", explode(range_udf(df.id))).show()
          Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
            File "/spark/python/pyspark/sql/dataframe.py", line 318, in show
              print(self._jdf.showString(n, 20))
            File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
            File "/spark/python/pyspark/sql/utils.py", line 63, in deco
              return f(*a, **kw)
            File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o126.showString.: java.lang.AssertionError: assertion failed
              at scala.Predef$.assert(Predef.scala:156)
              at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:120)
              at org.apache.spark.sql.execution.GenerateExec.consume(GenerateExec.scala:57)
      
      The cause of this issue is, in `ExtractPythonUDFs` we insert `BatchEvalPythonExec` to run PythonUDFs in batch. `BatchEvalPythonExec` will add extra outputs (e.g., `pythonUDF0`) to original plan. In above case, the original `Range` only has one output `id`. After `ExtractPythonUDFs`, the added `BatchEvalPythonExec` has two outputs `id` and `pythonUDF0`.
      
      Because the output of `GenerateExec` is given after analysis phase, in above case, it is the combination of `id`, i.e., the output of `Range`, and `col`. But in planning phase, we change `GenerateExec`'s child plan to `BatchEvalPythonExec` with additional output attributes.
      
      It will cause no problem in non wholestage codegen. Because when evaluating the additional attributes are projected out the final output of `GenerateExec`.
      
      However, as `GenerateExec` now supports wholestage codegen, the framework will input all the outputs of the child plan to `GenerateExec`. Then when consuming `GenerateExec`'s output data (i.e., calling `consume`), the number of output attributes is different to the output variables in wholestage codegen.
      
      To solve this issue, this patch only gives the generator's output to `GenerateExec` after analysis phase. `GenerateExec`'s output is the combination of its child plan's output and the generator's output. So when we change `GenerateExec`'s child, its output is still correct.
      
      ## How was this patch tested?
      
      Added test cases to PySpark.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16120 from viirya/fix-py-udf-with-generator.
      
      (cherry picked from commit 3ba69b64)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      fecd23d2
    • Shixiong Zhu's avatar
      [SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix... · c6a4e3d9
      Shixiong Zhu authored
      [SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix StreamingQueryException (branch 2.1)
      
      ## What changes were proposed in this pull request?
      
      Backport #16125 to branch 2.1.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16153 from zsxwing/SPARK-18694-2.1.
      c6a4e3d9
    • Nicholas Chammas's avatar
      [DOCS][MINOR] Update location of Spark YARN shuffle jar · 39759ff0
      Nicholas Chammas authored
      
      Looking at the distributions provided on spark.apache.org, I see that the Spark YARN shuffle jar is under `yarn/` and not `lib/`.
      
      This change is so minor I'm not sure it needs a JIRA. But let me know if so and I'll create one.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #16130 from nchammas/yarn-doc-fix.
      
      (cherry picked from commit 5a92dc76)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      39759ff0
    • Wenchen Fan's avatar
      [SPARK-18711][SQL] should disable subexpression elimination for LambdaVariable · e23c8cfc
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This is kind of a long-standing bug, it's hidden until https://github.com/apache/spark/pull/15780
      
       , which may add `AssertNotNull` on top of `LambdaVariable` and thus enables subexpression elimination.
      
      However, subexpression elimination will evaluate the common expressions at the beginning, which is invalid for `LambdaVariable`. `LambdaVariable` usually represents loop variable, which can't be evaluated ahead of the loop.
      
      This PR skips expressions containing `LambdaVariable` when doing subexpression elimination.
      
      ## How was this patch tested?
      
      updated test in `DatasetAggregatorSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16143 from cloud-fan/aggregator.
      
      (cherry picked from commit 01a7d33d)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      e23c8cfc
    • Reynold Xin's avatar
      Revert "[SPARK-18284][SQL] Make ExpressionEncoder.serializer.nullable precise" · 30c07430
      Reynold Xin authored
      This reverts commit fce1be6c from branch-2.1.
      30c07430
    • Dongjoon Hyun's avatar
      [MINOR][DOC] Use SparkR `TRUE` value and add default values for `StructField` in SQL Guide. · afd2321b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      In `SQL Programming Guide`, this PR uses `TRUE` instead of `True` in SparkR and adds default values of `nullable` for `StructField` in Scala/Python/R (i.e., "Note: The default value of nullable is true."). In Java API, `nullable` is not optional.
      
      **BEFORE**
      * SPARK 2.1.0 RC1
      http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-docs/sql-programming-guide.html#data-types
      
      **AFTER**
      
      * R
      <img width="916" alt="screen shot 2016-12-04 at 11 58 19 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877443/abba19a6-ba7d-11e6-8984-afbe00333fb0.png">
      
      * Scala
      <img width="914" alt="screen shot 2016-12-04 at 11 57 37 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877433/99ce734a-ba7d-11e6-8bb5-e8619041b09b.png">
      
      * Python
      <img width="914" alt="screen shot 2016-12-04 at 11 58 04 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877440/a5c89338-ba7d-11e6-8f92-6c0ae9388d7e.png
      
      ">
      
      ## How was this patch tested?
      
      Manual.
      
      ```
      cd docs
      SKIP_API=1 jekyll build
      open _site/index.html
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16141 from dongjoon-hyun/SPARK-SQL-GUIDE.
      
      (cherry picked from commit 410b7898)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      afd2321b
    • Yanbo Liang's avatar
      [SPARK-18279][DOC][ML][SPARKR] Add R examples to ML programming guide. · 1821cbea
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Add R examples to ML programming guide for the following algorithms as POC:
      * spark.glm
      * spark.survreg
      * spark.naiveBayes
      * spark.kmeans
      
      The four algorithms were added to SparkR since 2.0.0, more docs for algorithms added during 2.1 release cycle will be addressed in a separate follow-up PR.
      
      ## How was this patch tested?
      This is the screenshots of generated ML programming guide for ```GeneralizedLinearRegression```:
      ![image](https://cloud.githubusercontent.com/assets/1962026/20866403/babad856-b9e1-11e6-9984-62747801e8c4.png
      
      )
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16136 from yanboliang/spark-18279.
      
      (cherry picked from commit eb8dd681)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      1821cbea
    • Zheng RuiFeng's avatar
      [SPARK-18625][ML] OneVsRestModel should support setFeaturesCol and setPredictionCol · 88e07efe
      Zheng RuiFeng authored
      
      ## What changes were proposed in this pull request?
      add `setFeaturesCol` and `setPredictionCol` for `OneVsRestModel`
      
      ## How was this patch tested?
      added tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #16059 from zhengruifeng/ovrm_setCol.
      
      (cherry picked from commit bdfe7f67)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      88e07efe
  3. Dec 04, 2016
    • Felix Cheung's avatar
      [SPARK-18643][SPARKR] SparkR hangs at session start when installed as a package without Spark · c13c2939
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      If SparkR is running as a package and it has previously downloaded Spark Jar it should be able to run as before without having to set SPARK_HOME. Basically with this bug the auto install Spark will only work in the first session.
      
      This seems to be a regression on the earlier behavior.
      
      Fix is to always try to install or check for the cached Spark if running in an interactive session.
      As discussed before, we should probably only install Spark iff running in an interactive session (R shell, RStudio etc)
      
      ## How was this patch tested?
      
      Manually
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16077 from felixcheung/rsessioninteractive.
      
      (cherry picked from commit b019b3a8)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      c13c2939
    • Eric Liang's avatar
      [SPARK-18661][SQL] Creating a partitioned datasource table should not scan all files for table · 41d698ec
      Eric Liang authored
      
      ## What changes were proposed in this pull request?
      
      Even though in 2.1 creating a partitioned datasource table will not populate the partition data by default (until the user issues MSCK REPAIR TABLE), it seems we still scan the filesystem for no good reason.
      
      We should avoid doing this when the user specifies a schema.
      
      ## How was this patch tested?
      
      Perf stat tests.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #16090 from ericl/spark-18661.
      
      (cherry picked from commit d9eb4c72)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      41d698ec
    • Kapil Singh's avatar
      [SPARK-18091][SQL] Deep if expressions cause Generated... · 8145c82b
      Kapil Singh authored
      [SPARK-18091][SQL] Deep if expressions cause Generated SpecificUnsafeProjection code to exceed JVM code size limit
      
      ## What changes were proposed in this pull request?
      
      Fix for SPARK-18091 which is a bug related to large if expressions causing generated SpecificUnsafeProjection code to exceed JVM code size limit.
      
      This PR changes if expression's code generation to place its predicate, true value and false value expressions' generated code in separate methods in context so as to never generate too long combined code.
      ## How was this patch tested?
      
      Added a unit test and also tested manually with the application (having transformations similar to the unit test) which caused the issue to be identified in the first place.
      
      Author: Kapil Singh <kapsingh@adobe.com>
      
      Closes #15620 from kapilsingh5050/SPARK-18091-IfCodegenFix.
      
      (cherry picked from commit e463678b)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      8145c82b
  4. Dec 03, 2016
    • Yunni's avatar
      [SPARK-18081][ML][DOCS] Add user guide for Locality Sensitive Hashing(LSH) · 28f698b4
      Yunni authored
      
      ## What changes were proposed in this pull request?
      The user guide for LSH is added to ml-features.md, with several scala/java examples in spark-examples.
      
      ## How was this patch tested?
      Doc has been generated through Jekyll, and checked through manual inspection.
      
      Author: Yunni <Euler57721@gmail.com>
      Author: Yun Ni <yunn@uber.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: Yun Ni <Euler57721@gmail.com>
      
      Closes #15795 from Yunni/SPARK-18081-lsh-guide.
      
      (cherry picked from commit 34777184)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      28f698b4
    • Nattavut Sutyanyong's avatar
      [SPARK-18582][SQL] Whitelist LogicalPlan operators allowed in correlated subqueries · b098b484
      Nattavut Sutyanyong authored
      
      ## What changes were proposed in this pull request?
      
      This fix puts an explicit list of operators that Spark supports for correlated subqueries.
      
      ## How was this patch tested?
      
      Run sql/test, catalyst/test and add a new test case on Generate.
      
      Author: Nattavut Sutyanyong <nsy.can@gmail.com>
      
      Closes #16046 from nsyca/spark18455.0.
      
      (cherry picked from commit 4a3c0960)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      b098b484
    • hyukjinkwon's avatar
      [SPARK-18685][TESTS] Fix URI and release resources after opening in tests at... · 28ea432a
      hyukjinkwon authored
      [SPARK-18685][TESTS] Fix URI and release resources after opening in tests at ExecutorClassLoaderSuite
      
      ## What changes were proposed in this pull request?
      
      This PR fixes two problems as below:
      
      - Close `BufferedSource` after `Source.fromInputStream(...)` to release resource and make the tests pass on Windows in `ExecutorClassLoaderSuite`
      
        ```
        [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.repl.ExecutorClassLoaderSuite *** ABORTED *** (7 seconds, 333 milliseconds)
        [info]   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-77b2f37b-6405-47c4-af1c-4a6a206511f2
        [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
        [info]   at org.apache.spark.repl.ExecutorClassLoaderSuite.afterAll(ExecutorClassLoaderSuite.scala:76)
        [info]   at org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213)
        ...
        ```
      
      - Fix URI correctly so that related tests can be passed on Windows.
      
        ```
        [info] - child first *** FAILED *** (78 milliseconds)
        [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
        [info]   at java.net.URI$Parser.fail(URI.java:2848)
        [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
        ...
        [info] - parent first *** FAILED *** (15 milliseconds)
        [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
        [info]   at java.net.URI$Parser.fail(URI.java:2848)
        [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
        ...
        [info] - child first can fall back *** FAILED *** (0 milliseconds)
        [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
        [info]   at java.net.URI$Parser.fail(URI.java:2848)
        [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
        ...
        [info] - child first can fail *** FAILED *** (0 milliseconds)
        [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
        [info]   at java.net.URI$Parser.fail(URI.java:2848)
        [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
        ...
        [info] - resource from parent *** FAILED *** (0 milliseconds)
        [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
        [info]   at java.net.URI$Parser.fail(URI.java:2848)
        [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
        ...
        [info] - resources from parent *** FAILED *** (0 milliseconds)
        [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
        [info]   at java.net.URI$Parser.fail(URI.java:2848)
        [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
        ```
      
      ## How was this patch tested?
      
      Manually tested via AppVeyor.
      
      **Before**
      https://ci.appveyor.com/project/spark-test/spark/build/102-rpel-ExecutorClassLoaderSuite
      
      **After**
      https://ci.appveyor.com/project/spark-test/spark/build/108-rpel-ExecutorClassLoaderSuite
      
      
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16116 from HyukjinKwon/close-after-open.
      
      (cherry picked from commit d1312fb7)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      28ea432a
  5. Dec 02, 2016
    • zero323's avatar
      [SPARK-18690][PYTHON][SQL] Backward compatibility of unbounded frames · cf3dbec6
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Makes `Window.unboundedPreceding` and `Window.unboundedFollowing` backward compatible.
      
      ## How was this patch tested?
      
      Pyspark SQL unittests.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16123 from zero323/SPARK-17845-follow-up.
      
      (cherry picked from commit a9cbfc4f)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      cf3dbec6
    • Yanbo Liang's avatar
      [SPARK-18324][ML][DOC] Update ML programming and migration guide for 2.1 release · 839d4e9c
      Yanbo Liang authored
      
      ## What changes were proposed in this pull request?
      Update ML programming and migration guide for 2.1 release.
      
      ## How was this patch tested?
      Doc change, no test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16076 from yanboliang/spark-18324.
      
      (cherry picked from commit 2dc0d7ef)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      839d4e9c
    • Shixiong Zhu's avatar
      [SPARK-18670][SS] Limit the number of... · f5376327
      Shixiong Zhu authored
      [SPARK-18670][SS] Limit the number of StreamingQueryListener.StreamProgressEvent when there is no data
      
      ## What changes were proposed in this pull request?
      
      This PR adds a sql conf `spark.sql.streaming.noDataReportInterval` to control how long to wait before outputing the next StreamProgressEvent when there is no data.
      
      ## How was this patch tested?
      
      The added unit test.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16108 from zsxwing/SPARK-18670.
      
      (cherry picked from commit 56a503df)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      f5376327
    • Yanbo Liang's avatar
      [SPARK-18291][SPARKR][ML] Revert "[SPARK-18291][SPARKR][ML] SparkR glm predict... · f915f812
      Yanbo Liang authored
      [SPARK-18291][SPARKR][ML] Revert "[SPARK-18291][SPARKR][ML] SparkR glm predict should output original label when family = binomial."
      
      ## What changes were proposed in this pull request?
      It's better we can fix this issue by providing an option ```type``` for users to change the ```predict``` output schema, then they could output probabilities, log-space predictions, or original labels. In order to not involve breaking API change for 2.1, so revert this change firstly and will add it back after [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618
      
      ) resolved.
      
      ## How was this patch tested?
      Existing unit tests.
      
      This reverts commit daa975f4.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16118 from yanboliang/spark-18291-revert.
      
      (cherry picked from commit a985dd8e)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      f915f812
    • Ryan Blue's avatar
      [SPARK-18677] Fix parsing ['key'] in JSON path expressions. · c69825a9
      Ryan Blue authored
      
      ## What changes were proposed in this pull request?
      
      This fixes the parser rule to match named expressions, which doesn't work for two reasons:
      1. The name match is not coerced to a regular expression (missing .r)
      2. The surrounding literals are incorrect and attempt to escape a single quote, which is unnecessary
      
      ## How was this patch tested?
      
      This adds test cases for named expressions using the bracket syntax, including one with quoted spaces.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #16107 from rdblue/SPARK-18677-fix-json-path.
      
      (cherry picked from commit 48778976)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      c69825a9
    • gatorsmile's avatar
      [SPARK-18674][SQL][FOLLOW-UP] improve the error message of using join · 32c85383
      gatorsmile authored
      
      ### What changes were proposed in this pull request?
      Added a test case for using joins with nested fields.
      
      ### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16110 from gatorsmile/followup-18674.
      
      (cherry picked from commit 2f8776cc)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      32c85383
    • Eric Liang's avatar
      [SPARK-18659][SQL] Incorrect behaviors in overwrite table for datasource tables · e374b242
      Eric Liang authored
      
      ## What changes were proposed in this pull request?
      
      Two bugs are addressed here
      1. INSERT OVERWRITE TABLE sometime crashed when catalog partition management was enabled. This was because when dropping partitions after an overwrite operation, the Hive client will attempt to delete the partition files. If the entire partition directory was dropped, this would fail. The PR fixes this by adding a flag to control whether the Hive client should attempt to delete files.
      2. The static partition spec for OVERWRITE TABLE was not correctly resolved to the case-sensitive original partition names. This resulted in the entire table being overwritten if you did not correctly capitalize your partition names.
      
      cc yhuai cloud-fan
      
      ## How was this patch tested?
      
      Unit tests. Surprisingly, the existing overwrite table tests did not catch these edge cases.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #16088 from ericl/spark-18659.
      
      (cherry picked from commit 7935c847)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      e374b242
    • Dongjoon Hyun's avatar
      [SPARK-18419][SQL] `JDBCRelation.insert` should not remove Spark options · 415730e1
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, `JDBCRelation.insert` removes Spark options too early by mistakenly using `asConnectionProperties`. Spark options like `numPartitions` should be passed into `DataFrameWriter.jdbc` correctly. This bug have been **hidden** because `JDBCOptions.asConnectionProperties` fails to filter out the mixed-case options. This PR aims to fix both.
      
      **JDBCRelation.insert**
      ```scala
      override def insert(data: DataFrame, overwrite: Boolean): Unit = {
        val url = jdbcOptions.url
        val table = jdbcOptions.table
      - val properties = jdbcOptions.asConnectionProperties
      + val properties = jdbcOptions.asProperties
        data.write
          .mode(if (overwrite) SaveMode.Overwrite else SaveMode.Append)
          .jdbc(url, table, properties)
      ```
      
      **JDBCOptions.asConnectionProperties**
      ```scala
      scala> import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions
      scala> import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
      scala> new JDBCOptions(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10")).asConnectionProperties
      res0: java.util.Properties = {numpartitions=10}
      scala> new JDBCOptions(new CaseInsensitiveMap(Map("url" -> "jdbc:mysql://localhost:3306/temp
      
      ", "dbtable" -> "t1", "numPartitions" -> "10"))).asConnectionProperties
      res1: java.util.Properties = {numpartitions=10}
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins with a new testcase.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15863 from dongjoon-hyun/SPARK-18419.
      
      (cherry picked from commit 55d528f2)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      415730e1
    • Eric Liang's avatar
      [SPARK-18679][SQL] Fix regression in file listing performance for non-catalog tables · 65e896a6
      Eric Liang authored
      
      ## What changes were proposed in this pull request?
      
      In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory.
      
      This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors).
      
      cc mallman  cloud-fan
      
      ## How was this patch tested?
      
      Checked metrics in unit tests.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #16112 from ericl/spark-18679.
      
      (cherry picked from commit 294163ee)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      65e896a6
    • Cheng Lian's avatar
      [SPARK-17213][SQL] Disable Parquet filter push-down for string and binary... · a7f8ebb8
      Cheng Lian authored
      [SPARK-17213][SQL] Disable Parquet filter push-down for string and binary columns due to PARQUET-686
      
      This PR targets to both master and branch-2.1.
      
      ## What changes were proposed in this pull request?
      
      Due to PARQUET-686, Parquet doesn't do string comparison correctly while doing filter push-down for string columns. This PR disables filter push-down for both string and binary columns to work around this issue. Binary columns are also affected because some Parquet data models (like Hive) may store string columns as a plain Parquet `binary` instead of a `binary (UTF8)`.
      
      ## How was this patch tested?
      
      New test case added in `ParquetFilterSuite`.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #16106 from liancheng/spark-17213-bad-string-ppd.
      
      (cherry picked from commit ca639163)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      a7f8ebb8
  6. Dec 01, 2016
    • Wenchen Fan's avatar
      [SPARK-18647][SQL] do not put provider in table properties for Hive serde table · 0f0903d1
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      In Spark 2.1, we make Hive serde tables case-preserving by putting the table metadata in table properties, like what we did for data source table. However, we should not put table provider, as it will break forward compatibility. e.g. if we create a Hive serde table with Spark 2.1, using `sql("create table test stored as parquet as select 1")`, we will fail to read it with Spark 2.0, as Spark 2.0 mistakenly treat it as data source table because there is a `provider` entry in table properties.
      
      Logically Hive serde table's provider is always hive, we don't need to store it in table properties, this PR removes it.
      
      ## How was this patch tested?
      
      manually test the forward compatibility issue.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16080 from cloud-fan/hive.
      
      (cherry picked from commit a5f02b00)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      0f0903d1
    • Kazuaki Ishizaki's avatar
      [SPARK-18284][SQL] Make ExpressionEncoder.serializer.nullable precise · fce1be6c
      Kazuaki Ishizaki authored
      
      ## What changes were proposed in this pull request?
      
      This PR makes `ExpressionEncoder.serializer.nullable` for flat encoder for a primitive type `false`. Since it is `true` for now, it is too conservative.
      While `ExpressionEncoder.schema` has correct information (e.g. `<IntegerType, false>`), `serializer.head.nullable` of `ExpressionEncoder`, which got from `encoderFor[T]`, is always false. It is too conservative.
      
      This is accomplished by checking whether a type is one of primitive types. If it is `true`, `nullable` should be `false`.
      
      ## How was this patch tested?
      
      Added new tests for encoder and dataframe
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #15780 from kiszk/SPARK-18284.
      
      (cherry picked from commit 38b9e696)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      fce1be6c
    • gatorsmile's avatar
      [SPARK-18538][SQL][BACKPORT-2.1] Fix Concurrent Table Fetching Using DataFrameReader JDBC APIs · b9eb1004
      gatorsmile authored
      ### What changes were proposed in this pull request?
      
      #### This PR is to backport https://github.com/apache/spark/pull/15975 to Branch 2.1
      
      ---
      
      The following two `DataFrameReader` JDBC APIs ignore the user-specified parameters of parallelism degree.
      
      ```Scala
        def jdbc(
            url: String,
            table: String,
            columnName: String,
            lowerBound: Long,
            upperBound: Long,
            numPartitions: Int,
            connectionProperties: Properties): DataFrame
      ```
      
      ```Scala
        def jdbc(
            url: String,
            table: String,
            predicates: Array[String],
            connectionProperties: Properties): DataFrame
      ```
      
      This PR is to fix the issues. To verify the behavior correctness, we improve the plan output of `EXPLAIN` command by adding `numPartitions` in the `JDBCRelation` node.
      
      Before the fix,
      ```
      == Physical Plan ==
      *Scan JDBCRelation(TEST.PEOPLE) [NAME#1896,THEID#1897] ReadSchema: struct<NAME:string,THEID:int>
      ```
      
      After the fix,
      ```
      == Physical Plan ==
      *Scan JDBCRelation(TEST.PEOPLE) [numPartitions=3] [NAME#1896,THEID#1897] ReadSchema: struct<NAME:string,THEID:int>
      ```
      ### How was this patch tested?
      Added the verification logics on all the test cases for JDBC concurrent fetching.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16111 from gatorsmile/jdbcFix2.1.
      b9eb1004
    • sureshthalamati's avatar
      [SPARK-18141][SQL] Fix to quote column names in the predicate clause of the... · 2f91b015
      sureshthalamati authored
      [SPARK-18141][SQL] Fix to quote column names in the predicate clause  of the JDBC RDD generated sql statement
      
      ## What changes were proposed in this pull request?
      
      SQL query generated for the JDBC data source is not quoting columns in the predicate clause. When the source table has quoted column names,  spark jdbc read fails with column not found error incorrectly.
      
      Error:
      org.h2.jdbc.JdbcSQLException: Column "ID" not found;
      Source SQL statement:
      SELECT "Name","Id" FROM TEST."mixedCaseCols" WHERE (Id < 1)
      
      This PR fixes by quoting column names in the generated  SQL for predicate clause  when filters are pushed down to the data source.
      
      Source SQL statement after the fix:
      SELECT "Name","Id" FROM TEST."mixedCaseCols" WHERE ("Id" < 1)
      
      ## How was this patch tested?
      
      Added new test case to the JdbcSuite
      
      Author: sureshthalamati <suresh.thalamati@gmail.com>
      
      Closes #15662 from sureshthalamati/filter_quoted_cols-SPARK-18141.
      
      (cherry picked from commit 70c5549e)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      2f91b015
    • Reynold Xin's avatar
      [SPARK-18639] Build only a single pip package · 2d2e8018
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      We current build 5 separate pip binary tar balls, doubling the release script runtime. It'd be better to build one, especially for use cases that are just using Spark locally. In the long run, it would make more sense to have Hadoop support be pluggable.
      
      ## How was this patch tested?
      N/A - this is a release build script that doesn't have any automated test coverage. We will know if it goes wrong when we prepare releases.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16072 from rxin/SPARK-18639.
      
      (cherry picked from commit 37e52f87)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      2d2e8018
Loading