Skip to content
Snippets Groups Projects
  1. Jan 18, 2017
    • Adam Roberts's avatar
      [SPARK-18782][BUILD] Bump Hadoop 2.6 version to use Hadoop 2.6.5 · 17ce0b5b
      Adam Roberts authored
      **What changes were proposed in this pull request?**
      
      Use Hadoop 2.6.5 for the Hadoop 2.6 profile, I see a bunch of fixes including security ones in the release notes that we should pick up
      
      **How was this patch tested?**
      
      Running the unit tests now with IBM's SDK for Java and let's see what happens with OpenJDK in the community builder - expecting no trouble as it is only a minor release.
      
      Author: Adam Roberts <aroberts@uk.ibm.com>
      
      Closes #16616 from a-roberts/Hadoop265Bumper.
      Unverified
      17ce0b5b
    • uncleGen's avatar
      [SPARK-19227][SPARK-19251] remove unused imports and outdated comments · eefdf9f9
      uncleGen authored
      ## What changes were proposed in this pull request?
      remove ununsed imports and outdated comments, and fix some minor code style issue.
      
      ## How was this patch tested?
      existing ut
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #16591 from uncleGen/SPARK-19227.
      Unverified
      eefdf9f9
    • Wenchen Fan's avatar
      [SPARK-18243][SQL] Port Hive writing to use FileFormat interface · 4494cd97
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Inserting data into Hive tables has its own implementation that is distinct from data sources: `InsertIntoHiveTable`, `SparkHiveWriterContainer` and `SparkHiveDynamicPartitionWriterContainer`.
      
      Note that one other major difference is that data source tables write directly to the final destination without using some staging directory, and then Spark itself adds the partitions/tables to the catalog. Hive tables actually write to some staging directory, and then call Hive metastore's loadPartition/loadTable function to load those data in. So we still need to keep `InsertIntoHiveTable` to put this special logic. In the future, we should think of writing to the hive table location directly, so that we don't need to call `loadTable`/`loadPartition` at the end and remove `InsertIntoHiveTable`.
      
      This PR removes `SparkHiveWriterContainer` and `SparkHiveDynamicPartitionWriterContainer`, and create a `HiveFileFormat` to implement the write logic. In the future, we should also implement the read logic in `HiveFileFormat`.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16517 from cloud-fan/insert-hive.
      4494cd97
  2. Jan 17, 2017
    • Zheng RuiFeng's avatar
      [SPARK-18206][ML] Add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiR · e7f982b2
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiR
      ## How was this patch tested?
      
      local test in spark-shell
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      Author: Ruifeng Zheng <ruifengz@foxmail.com>
      
      Closes #15671 from zhengruifeng/lir_instr.
      e7f982b2
    • Bogdan Raducanu's avatar
      [SPARK-13721][SQL] Support outer generators in DataFrame API · 2992a0e7
      Bogdan Raducanu authored
      ## What changes were proposed in this pull request?
      
      Added outer_explode, outer_posexplode, outer_inline functions and expressions.
      Some bug fixing in GenerateExec.scala for CollectionGenerator. Previously it was not correctly handling the case of outer with empty collections, only with nulls.
      
      ## How was this patch tested?
      
      New tests added to GeneratorFunctionSuite
      
      Author: Bogdan Raducanu <bogdan.rdc@gmail.com>
      
      Closes #16608 from bogdanrdc/SPARK-13721.
      2992a0e7
    • Reynold Xin's avatar
      [SPARK-18917][SQL] Remove schema check in appending data · 83dff87d
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      In append mode, we check whether the schema of the write is compatible with the schema of the existing data. It can be a significant performance issue in cloud environment to find the existing schema for files. This patch removes the check.
      
      Note that for catalog tables, we always do the check, as discussed in https://github.com/apache/spark/pull/16339#discussion_r96208357
      
      ## How was this patch tested?
      N/A
      
      Closes #16339.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16622 from rxin/SPARK-18917.
      83dff87d
    • jiangxingbo's avatar
      [MINOR][SQL] Remove duplicate call of reset() function in CurrentOrigin.withOrigin() · fee20df1
      jiangxingbo authored
      ## What changes were proposed in this pull request?
      
      Remove duplicate call of reset() function in CurrentOrigin.withOrigin().
      
      ## How was this patch tested?
      
      Existing test cases.
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #16615 from jiangxb1987/dummy-code.
      fee20df1
    • DjvuLee's avatar
      [SPARK-19239][PYSPARK] Check parameters whether equals None when specify the column in jdbc API · 843ec8ec
      DjvuLee authored
      ## What changes were proposed in this pull request?
      
      The `jdbc` API do not check the `lowerBound` and `upperBound` when we
      specified the ``column``, and just throw the following exception:
      
      >```int() argument must be a string or a number, not 'NoneType'```
      
      If we check the parameter, we can give a more friendly suggestion.
      
      ## How was this patch tested?
      Test using the pyspark shell, without the lowerBound and upperBound parameters.
      
      Author: DjvuLee <lihu@bytedance.com>
      
      Closes #16599 from djvulee/pysparkFix.
      843ec8ec
    • gatorsmile's avatar
      [SPARK-19129][SQL] SessionCatalog: Disallow empty part col values in partition spec · a23debd7
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Empty partition column values are not valid for partition specification. Before this PR, we accept users to do it; however, Hive metastore does not detect and disallow it too. Thus, users hit the following strange error.
      
      ```Scala
      val df = spark.createDataFrame(Seq((0, "a"), (1, "b"))).toDF("partCol1", "name")
      df.write.mode("overwrite").partitionBy("partCol1").saveAsTable("partitionedTable")
      spark.sql("alter table partitionedTable drop partition(partCol1='')")
      spark.table("partitionedTable").show()
      ```
      
      In the above example, the WHOLE table is DROPPED when users specify a partition spec containing only one partition column with empty values.
      
      When the partition columns contains more than one, Hive metastore APIs simply ignore the columns with empty values and treat it as partial spec. This is also not expected. This does not follow the actual Hive behaviors. This PR is to disallow users to specify such an invalid partition spec in the `SessionCatalog` APIs.
      
      ### How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16583 from gatorsmile/disallowEmptyPartColValue.
      a23debd7
    • Shixiong Zhu's avatar
      [SPARK-19065][SQL] Don't inherit expression id in dropDuplicates · a83accfc
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      `dropDuplicates` will create an Alias using the same exprId, so `StreamExecution` should also replace Alias if necessary.
      
      ## How was this patch tested?
      
      test("SPARK-19065: dropDuplicates should not create expressions using the same id")
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16564 from zsxwing/SPARK-19065.
      a83accfc
    • hyukjinkwon's avatar
      [SPARK-19019] [PYTHON] Fix hijacked `collections.namedtuple` and port... · 20e62806
      hyukjinkwon authored
      [SPARK-19019] [PYTHON] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0
      
      ## What changes were proposed in this pull request?
      
      Currently, PySpark does not work with Python 3.6.0.
      
      Running `./bin/pyspark` simply throws the error as below and PySpark does not work at all:
      
      ```
      Traceback (most recent call last):
        File ".../spark/python/pyspark/shell.py", line 30, in <module>
          import pyspark
        File ".../spark/python/pyspark/__init__.py", line 46, in <module>
          from pyspark.context import SparkContext
        File ".../spark/python/pyspark/context.py", line 36, in <module>
          from pyspark.java_gateway import launch_gateway
        File ".../spark/python/pyspark/java_gateway.py", line 31, in <module>
          from py4j.java_gateway import java_import, JavaGateway, GatewayClient
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in <module>
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", line 62, in <module>
          import pkgutil
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", line 22, in <module>
          ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
        File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
          cls = _old_namedtuple(*args, **kwargs)
      TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
      ```
      
      The root cause seems because some arguments of `namedtuple` are now completely keyword-only arguments from Python 3.6.0 (See https://bugs.python.org/issue25628).
      
      We currently copy this function via `types.FunctionType` which does not set the default values of keyword-only arguments (meaning `namedtuple.__kwdefaults__`) and this seems causing internally missing values in the function (non-bound arguments).
      
      This PR proposes to work around this by manually setting it via `kwargs` as `types.FunctionType` seems not supporting to set this.
      
      Also, this PR ports the changes in cloudpickle for compatibility for Python 3.6.0.
      
      ## How was this patch tested?
      
      Manually tested with Python 2.7.6 and Python 3.6.0.
      
      ```
      ./bin/pyspsark
      ```
      
      , manual creation of `namedtuple` both in local and rdd with Python 3.6.0,
      
      and Jenkins tests for other Python versions.
      
      Also,
      
      ```
      ./run-tests --python-executables=python3.6
      ```
      
      ```
      Will test against the following Python executables: ['python3.6']
      Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
      Finished test(python3.6): pyspark.sql.tests (192s)
      Finished test(python3.6): pyspark.accumulators (3s)
      Finished test(python3.6): pyspark.mllib.tests (198s)
      Finished test(python3.6): pyspark.broadcast (3s)
      Finished test(python3.6): pyspark.conf (2s)
      Finished test(python3.6): pyspark.context (14s)
      Finished test(python3.6): pyspark.ml.classification (21s)
      Finished test(python3.6): pyspark.ml.evaluation (11s)
      Finished test(python3.6): pyspark.ml.clustering (20s)
      Finished test(python3.6): pyspark.ml.linalg.__init__ (0s)
      Finished test(python3.6): pyspark.streaming.tests (240s)
      Finished test(python3.6): pyspark.tests (240s)
      Finished test(python3.6): pyspark.ml.recommendation (19s)
      Finished test(python3.6): pyspark.ml.feature (36s)
      Finished test(python3.6): pyspark.ml.regression (37s)
      Finished test(python3.6): pyspark.ml.tuning (28s)
      Finished test(python3.6): pyspark.mllib.classification (26s)
      Finished test(python3.6): pyspark.mllib.evaluation (18s)
      Finished test(python3.6): pyspark.mllib.clustering (44s)
      Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s)
      Finished test(python3.6): pyspark.mllib.feature (26s)
      Finished test(python3.6): pyspark.mllib.fpm (23s)
      Finished test(python3.6): pyspark.mllib.random (8s)
      Finished test(python3.6): pyspark.ml.tests (92s)
      Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s)
      Finished test(python3.6): pyspark.mllib.linalg.distributed (25s)
      Finished test(python3.6): pyspark.mllib.stat._statistics (15s)
      Finished test(python3.6): pyspark.mllib.recommendation (24s)
      Finished test(python3.6): pyspark.mllib.regression (26s)
      Finished test(python3.6): pyspark.profiler (9s)
      Finished test(python3.6): pyspark.mllib.tree (16s)
      Finished test(python3.6): pyspark.shuffle (1s)
      Finished test(python3.6): pyspark.mllib.util (18s)
      Finished test(python3.6): pyspark.serializers (11s)
      Finished test(python3.6): pyspark.rdd (20s)
      Finished test(python3.6): pyspark.sql.conf (8s)
      Finished test(python3.6): pyspark.sql.catalog (17s)
      Finished test(python3.6): pyspark.sql.column (18s)
      Finished test(python3.6): pyspark.sql.context (18s)
      Finished test(python3.6): pyspark.sql.group (27s)
      Finished test(python3.6): pyspark.sql.dataframe (33s)
      Finished test(python3.6): pyspark.sql.functions (35s)
      Finished test(python3.6): pyspark.sql.types (6s)
      Finished test(python3.6): pyspark.sql.streaming (13s)
      Finished test(python3.6): pyspark.streaming.util (0s)
      Finished test(python3.6): pyspark.sql.session (16s)
      Finished test(python3.6): pyspark.sql.window (4s)
      Finished test(python3.6): pyspark.sql.readwriter (35s)
      Tests passed in 433 seconds
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16429 from HyukjinKwon/SPARK-19019.
      20e62806
    • jerryshao's avatar
      [SPARK-19179][YARN] Change spark.yarn.access.namenodes config and update docs · b79cc7ce
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      `spark.yarn.access.namenodes` configuration cannot actually reflects the usage of it, inside the code it is the Hadoop filesystems we get tokens, not NNs. So here propose to update the name of this configuration, also change the related code and doc.
      
      ## How was this patch tested?
      
      Local verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #16560 from jerryshao/SPARK-19179.
      b79cc7ce
    • hyukjinkwon's avatar
      [SPARK-3249][DOC] Fix links in ScalaDoc that cause warning messages in `sbt/sbt unidoc` · 6c00c069
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix ambiguous link warnings by simply making them as code blocks for both javadoc and scaladoc.
      
      ```
      [warn] .../spark/core/src/main/scala/org/apache/spark/Accumulator.scala:20: The link target "SparkContext#accumulator" is ambiguous. Several members fit the target:
      [warn] .../spark/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala:281: The link target "runMiniBatchSGD" is ambiguous. Several members fit the target:
      [warn] .../spark/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala:83: The link target "run" is ambiguous. Several members fit the target:
      ...
      ```
      
      This PR also fixes javadoc8 break as below:
      
      ```
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found
      [error]  * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product}
      [error]                                                   ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found
      [error]  * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product}
      [error]                                                                                ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found
      [error]  * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product}
      [error]                                                                                                ^
      [info] 3 errors
      ```
      
      ## How was this patch tested?
      
      Manually via `sbt unidoc > output.txt` and the checked it via `cat output.txt | grep ambiguous`
      
      and `sbt unidoc | grep error`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16604 from HyukjinKwon/SPARK-3249.
      Unverified
      6c00c069
    • Nick Lavers's avatar
      [SPARK-19219][SQL] Fix Parquet log output defaults · 0019005a
      Nick Lavers authored
      ## What changes were proposed in this pull request?
      
      Changing the default parquet logging levels to reflect the changes made in PR [#15538](https://github.com/apache/spark/pull/15538), in order to prevent the flood of log messages by default.
      
      ## How was this patch tested?
      
      Default log output when reading from parquet 1.6 files was compared with and without this change. The change eliminates the extraneous logging and makes the output readable.
      
      Author: Nick Lavers <nick.lavers@videoamp.com>
      
      Closes #16580 from nicklavers/spark-19219-set_default_parquet_log_level.
      Unverified
      0019005a
    • Wenchen Fan's avatar
      [SPARK-19240][SQL][TEST] add test for setting location for managed table · a774bca0
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      SET LOCATION can also work on managed table(or table created without custom path), the behavior is a little weird, but as we have already supported it, we should add a test to explicitly show the behavior.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16597 from cloud-fan/set-location.
      a774bca0
    • Yanbo Liang's avatar
      [MINOR][YARN] Move YarnSchedulerBackendSuite to resource-managers/yarn directory. · 84f0b645
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      #16092 moves YARN resource manager related code to resource-managers/yarn directory. The test case ```YarnSchedulerBackendSuite``` was added after that but with the wrong place. I move it to correct directory in this PR.
      
      ## How was this patch tested?
      Existing test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16595 from yanboliang/yarn.
      84f0b645
  3. Jan 16, 2017
    • Wenchen Fan's avatar
      [SPARK-19148][SQL] do not expose the external table concept in Catalog · 18ee55dd
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In https://github.com/apache/spark/pull/16296 , we reached a consensus that we should hide the external/managed table concept to users and only expose custom table path.
      
      This PR renames `Catalog.createExternalTable` to `createTable`(still keep the old versions for backward compatibility), and only set the table type to EXTERNAL if `path` is specified in options.
      
      ## How was this patch tested?
      
      new tests in `CatalogSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16528 from cloud-fan/create-table.
      18ee55dd
    • CodingCat's avatar
      [SPARK-18905][STREAMING] Fix the issue of removing a failed jobset from JobScheduler.jobSets · f8db8945
      CodingCat authored
      ## What changes were proposed in this pull request?
      
      the current implementation of Spark streaming considers a batch is completed no matter the results of the jobs (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)
      Let's consider the following case:
      A micro batch contains 2 jobs and they read from two different kafka topics respectively. One of these jobs is failed due to some problem in the user defined logic, after the other one is finished successfully.
      1. The main thread in the Spark streaming application will execute the line mentioned above,
      2. and another thread (checkpoint writer) will make a checkpoint file immediately after this line is executed.
      3. Then due to the current error handling mechanism in Spark Streaming, StreamingContext will be closed (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)
      the user recovers from the checkpoint file, and because the JobSet containing the failed job has been removed (taken as completed) before the checkpoint is constructed, the data being processed by the failed job would never be reprocessed
      
      This PR fix it by removing jobset from JobScheduler.jobSets only when all jobs in a jobset are successfully finished
      
      ## How was this patch tested?
      
      existing tests
      
      Author: CodingCat <zhunansjtu@gmail.com>
      Author: Nan Zhu <zhunansjtu@gmail.com>
      
      Closes #16542 from CodingCat/SPARK-18905.
      f8db8945
    • Felix Cheung's avatar
      [SPARK-18828][SPARKR] Refactor scripts for R · c84f7d3e
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Refactored script to remove duplications and clearer purpose for each script
      
      ## How was this patch tested?
      
      manually
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16249 from felixcheung/rscripts.
      c84f7d3e
    • Felix Cheung's avatar
      [SPARK-19232][SPARKR] Update Spark distribution download cache location on Windows · a115a543
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Windows seems to be the only place with appauthor in the path, for which we should say "Apache" (and case sensitive)
      Current path of `AppData\Local\spark\spark\Cache` is a bit odd.
      
      ## How was this patch tested?
      
      manual.
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16590 from felixcheung/rcachedir.
      a115a543
    • wm624@hotmail.com's avatar
      [SPARK-19066][SPARKR] SparkR LDA doesn't set optimizer correctly · 12c8c216
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE based on scala code.
      
      In addition, the `summary` method should bring back the results related to `DistributedLDAModel`.
      
      ## How was this patch tested?
      Manual tests by comparing with scala example.
      Modified the current unit test: fix the incorrect unit test and add necessary tests for `summary` method.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16464 from wangmiao1981/new.
      12c8c216
    • jiangxingbo's avatar
      [SPARK-18801][SQL][FOLLOWUP] Alias the view with its child · e635cbb6
      jiangxingbo authored
      ## What changes were proposed in this pull request?
      
      This PR is a follow-up to address the comments https://github.com/apache/spark/pull/16233/files#r95669988 and https://github.com/apache/spark/pull/16233/files#r95662299.
      
      We try to wrap the child by:
      1. Generate the `queryOutput` by:
          1.1. If the query column names are defined, map the column names to attributes in the child output by name;
          1.2. Else set the child output attributes to `queryOutput`.
      2. Map the `queryQutput` to view output by index, if the corresponding attributes don't match, try to up cast and alias the attribute in `queryOutput` to the attribute in the view output.
      3. Add a Project over the child, with the new output generated by the previous steps.
      If the view output doesn't have the same number of columns neither with the child output, nor with the query column names, throw an AnalysisException.
      
      ## How was this patch tested?
      
      Add new test cases in `SQLViewSuite`.
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #16561 from jiangxb1987/alias-view.
      e635cbb6
    • Liang-Chi Hsieh's avatar
      [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parquet · 61e48f52
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      We have a config `spark.sql.files.ignoreCorruptFiles` which can be used to ignore corrupt files when reading files in SQL. Currently the `ignoreCorruptFiles` config has two issues and can't work for Parquet:
      
      1. We only ignore corrupt files in `FileScanRDD` . Actually, we begin to read those files as early as inferring data schema from the files. For corrupt files, we can't read the schema and fail the program. A related issue reported at http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tc20418.html
      2. In `FileScanRDD`, we assume that we only begin to read the files when starting to consume the iterator. However, it is possibly the files are read before that. In this case, `ignoreCorruptFiles` config doesn't work too.
      
      This patch targets Parquet datasource. If this direction is ok, we can address the same issue for other datasources like Orc.
      
      Two main changes in this patch:
      
      1. Replace `ParquetFileReader.readAllFootersInParallel` by implementing the logic to read footers in multi-threaded manner
      
          We can't ignore corrupt files if we use `ParquetFileReader.readAllFootersInParallel`. So this patch implements the logic to do the similar thing in `readParquetFootersInParallel`.
      
      2. In `FileScanRDD`, we need to ignore corrupt file too when we call `readFunction` to return iterator.
      
      One thing to notice is:
      
      We read schema from Parquet file's footer. The method to read footer `ParquetFileReader.readFooter` throws `RuntimeException`, instead of `IOException`, if it can't successfully read the footer. Please check out https://github.com/apache/parquet-mr/blob/df9d8e415436292ae33e1ca0b8da256640de9710/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L470. So this patch catches `RuntimeException`.  One concern is that it might also shadow other runtime exceptions other than reading corrupt files.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16474 from viirya/fix-ignorecorrupted-parquet-files.
      61e48f52
  4. Jan 15, 2017
    • gatorsmile's avatar
      [SPARK-19120] Refresh Metadata Cache After Loading Hive Tables · de62ddf7
      gatorsmile authored
      ### What changes were proposed in this pull request?
      ```Scala
              sql("CREATE TABLE tab (a STRING) STORED AS PARQUET")
      
              // This table fetch is to fill the cache with zero leaf files
              spark.table("tab").show()
      
              sql(
                s"""
                   |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
                   |INTO TABLE tab
                 """.stripMargin)
      
              spark.table("tab").show()
      ```
      
      In the above example, the returned result is empty after table loading. The metadata cache could be out of dated after loading new data into the table, because loading/inserting does not update the cache. So far, the metadata cache is only used for data source tables. Thus, for Hive serde tables, only `parquet` and `orc` formats are facing such issues, because the Hive serde tables in the format of  parquet/orc could be converted to data source tables when `spark.sql.hive.convertMetastoreParquet`/`spark.sql.hive.convertMetastoreOrc` is on.
      
      This PR is to refresh the metadata cache after processing the `LOAD DATA` command.
      
      In addition, Spark SQL does not convert **partitioned** Hive tables (orc/parquet) to data source tables in the write path, but the read path is using the metadata cache for both **partitioned** and non-partitioned Hive tables (orc/parquet). That means, writing the partitioned parquet/orc tables still use `InsertIntoHiveTable`, instead of `InsertIntoHadoopFsRelationCommand`. To avoid reading the out-of-dated cache, `InsertIntoHiveTable` needs to refresh the metadata cache for partitioned tables. Note, it does not need to refresh the cache for non-partitioned parquet/orc tables, because it does not call `InsertIntoHiveTable` at all. Based on the comments, this PR will keep the existing logics unchanged. That means, we always refresh the table no matter whether the table is partitioned or not.
      
      ### How was this patch tested?
      Added test cases in parquetSuites.scala
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16500 from gatorsmile/refreshInsertIntoHiveTable.
      de62ddf7
    • uncleGen's avatar
      [SPARK-19206][DOC][DSTREAM] Fix outdated parameter descriptions in kafka010 · a5e651f4
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      Fix outdated parameter descriptions in kafka010
      
      ## How was this patch tested?
      
      cc koeninger  zsxwing
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #16569 from uncleGen/SPARK-19206.
      Unverified
      a5e651f4
    • Shixiong Zhu's avatar
      [SPARK-18971][CORE] Upgrade Netty to 4.0.43.Final · a8567e34
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Upgrade Netty to `4.0.43.Final` to add the fix for https://github.com/netty/netty/issues/6153
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16568 from zsxwing/SPARK-18971.
      Unverified
      a8567e34
    • Maurus Cuelenaere's avatar
      [MINOR][DOC] Document local[*,F] master modes · 3df2d931
      Maurus Cuelenaere authored
      ## What changes were proposed in this pull request?
      
      core/src/main/scala/org/apache/spark/SparkContext.scala contains LOCAL_N_FAILURES_REGEX master mode, but this was never documented, so do so.
      
      ## How was this patch tested?
      
      By using the Github Markdown preview feature.
      
      Author: Maurus Cuelenaere <mcuelenaere@gmail.com>
      
      Closes #16562 from mcuelenaere/patch-1.
      Unverified
      3df2d931
    • xiaojian.fxj's avatar
      [SPARK-19042] spark executor can't download the jars when uber jar's http url... · c9d612f8
      xiaojian.fxj authored
      [SPARK-19042] spark executor can't download the jars when uber jar's http url contains any query strings
      
      If the uber jars' https contains any query strings, the Executor.updateDependencies method can't can't download the jars correctly. This is because  the "localName = name.split("/").last" won't get the expected jar's url. The bug fix is the same as [SPARK-17855]
      
      Author: xiaojian.fxj <xiaojian.fxj@alibaba-inc.com>
      
      Closes #16509 from hustfxj/bug.
      Unverified
      c9d612f8
    • Tsuyoshi Ozawa's avatar
      [SPARK-19207][SQL] LocalSparkSession should use Slf4JLoggerFactory.INSTANCE · 9112f31b
      Tsuyoshi Ozawa authored
      ## What changes were proposed in this pull request?
      
      Using Slf4JLoggerFactory.INSTANCE instead of creating Slf4JLoggerFactory's object with constructor. It's deprecated.
      
      ## How was this patch tested?
      
      With running StateStoreRDDSuite.
      
      Author: Tsuyoshi Ozawa <ozawa@apache.org>
      
      Closes #16570 from oza/SPARK-19207.
      Unverified
      9112f31b
  5. Jan 14, 2017
    • windpiger's avatar
      [SPARK-19151][SQL] DataFrameWriter.saveAsTable support hive overwrite · 89423539
      windpiger authored
      ## What changes were proposed in this pull request?
      
      After [SPARK-19107](https://issues.apache.org/jira/browse/SPARK-19107), we now can treat hive as a data source and create hive tables with DataFrameWriter and Catalog. However, the support is not completed, there are still some cases we do not support.
      
      This PR implement:
      DataFrameWriter.saveAsTable work with hive format with overwrite mode
      
      ## How was this patch tested?
      unit test added
      
      Author: windpiger <songjun@outlook.com>
      
      Closes #16549 from windpiger/saveAsTableWithHiveOverwrite.
      89423539
    • hyukjinkwon's avatar
      [SPARK-19221][PROJECT INFRA][R] Add winutils binaries to the path in AppVeyor... · b6a7aa4f
      hyukjinkwon authored
      [SPARK-19221][PROJECT INFRA][R] Add winutils binaries to the path in AppVeyor tests for Hadoop libraries to call native codes properly
      
      ## What changes were proposed in this pull request?
      
      It seems Hadoop libraries need winutils binaries for native libraries in the path.
      
      It is not a problem in tests for now because we are only testing SparkR on Windows via AppVeyor but it can be a problem if we run Scala tests via AppVeyor as below:
      
      ```
       - SPARK-18220: read Hive orc table with varchar column *** FAILED *** (3 seconds, 937 milliseconds)
         org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
         at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625)
         at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:609)
         at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
         ...
      ```
      
      This PR proposes to add it to the `Path` for AppVeyor tests.
      
      ## How was this patch tested?
      
      Manually via AppVeyor.
      
      **Before**
      https://ci.appveyor.com/project/spark-test/spark/build/549-windows-complete/job/gc8a1pjua2bc4i8m
      
      **After**
      https://ci.appveyor.com/project/spark-test/spark/build/572-windows-complete/job/c4vrysr5uvj2hgu7
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16584 from HyukjinKwon/set-path-appveyor.
      b6a7aa4f
  6. Jan 13, 2017
    • Yucai Yu's avatar
      [SPARK-19180] [SQL] the offset of short should be 2 in OffHeapColumn · ad0dadaa
      Yucai Yu authored
      ## What changes were proposed in this pull request?
      
      the offset of short is 4 in OffHeapColumnVector's putShorts, but actually it should be 2.
      
      ## How was this patch tested?
      
      unit test
      
      Author: Yucai Yu <yucai.yu@intel.com>
      
      Closes #16555 from yucai/offheap_short.
      ad0dadaa
    • Felix Cheung's avatar
      [SPARK-18335][SPARKR] createDataFrame to support numPartitions parameter · b0e8eb6d
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      To allow specifying number of partitions when the DataFrame is created
      
      ## How was this patch tested?
      
      manual, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16512 from felixcheung/rnumpart.
      b0e8eb6d
    • Vinayak's avatar
      [SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a... · 285a7798
      Vinayak authored
      [SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a new SQLContext object fails with a Derby error
      
      Change is for SQLContext to reuse the active SparkSession during construction if the sparkContext supplied is the same as the currently active SparkContext. Without this change, a new SparkSession is instantiated that results in a Derby error when attempting to create a dataframe using a new SQLContext object even though the SparkContext supplied to the new SQLContext is same as the currently active one. Refer https://issues.apache.org/jira/browse/SPARK-18687 for details on the error and a repro.
      
      Existing unit tests and a new unit test added to pyspark-sql:
      
      /python/run-tests --python-executables=python --modules=pyspark-sql
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Vinayak <vijoshi5@in.ibm.com>
      Author: Vinayak Joshi <vijoshi@users.noreply.github.com>
      
      Closes #16119 from vijoshi/SPARK-18687_master.
      285a7798
    • Andrew Ash's avatar
      Fix missing close-parens for In filter's toString · b040cef2
      Andrew Ash authored
      Otherwise the open parentheses isn't closed in query plan descriptions of batch scans.
      
          PushedFilters: [In(COL_A, [1,2,4,6,10,16,219,815], IsNotNull(COL_B), ...
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #16558 from ash211/patch-9.
      b040cef2
    • Wenchen Fan's avatar
      [SPARK-19178][SQL] convert string of large numbers to int should return null · 6b34e745
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      When we convert a string to integral, we will convert that string to `decimal(20, 0)` first, so that we can turn a string with decimal format to truncated integral, e.g. `CAST('1.2' AS int)` will return `1`.
      
      However, this brings problems when we convert a string with large numbers to integral, e.g. `CAST('1234567890123' AS int)` will return `1912276171`, while Hive returns null as we expected.
      
      This is a long standing bug(seems it was there the first day Spark SQL was created), this PR fixes this bug by adding the native support to convert `UTF8String` to integral.
      
      ## How was this patch tested?
      
      new regression tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16550 from cloud-fan/string-to-int.
      6b34e745
    • wm624@hotmail.com's avatar
      [SPARK-19142][SPARKR] spark.kmeans should take seed, initSteps, and tol as parameters · 7f24a0b6
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans.
      
      Add missing parameters and corresponding document.
      
      Modified existing unit tests to take additional parameters.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16523 from wangmiao1981/kmeans.
      7f24a0b6
  7. Jan 12, 2017
Loading