Skip to content
Snippets Groups Projects
  1. Aug 30, 2017
    • caoxuewen's avatar
      [MINOR][SQL][TEST] Test shuffle hash join while is not expected · 235d2833
      caoxuewen authored
      ## What changes were proposed in this pull request?
      
      igore("shuffle hash join") is to shuffle hash join to test _case class ShuffledHashJoinExec_.
      But when you 'ignore' -> 'test', the test is _case class BroadcastHashJoinExec_.
      
      Before modified,  as a result of:canBroadcast is true.
      Print information in _canBroadcast(plan: LogicalPlan)_
      ```
      canBroadcast plan.stats.sizeInBytes:6710880
      canBroadcast conf.autoBroadcastJoinThreshold:10000000
      ```
      
      After modified, plan.stats.sizeInBytes is 11184808.
      Print information in _canBuildLocalHashMap(plan: LogicalPlan)_
      and _muchSmaller(a: LogicalPlan, b: LogicalPlan)_ :
      
      ```
      canBuildLocalHashMap plan.stats.sizeInBytes:11184808
      canBuildLocalHashMap conf.autoBroadcastJoinThreshold:10000000
      canBuildLocalHashMap conf.numShufflePartitions:2
      ```
      ```
      muchSmaller a.stats.sizeInBytes * 3:33554424
      muchSmaller b.stats.sizeInBytes:33554432
      ```
      ## How was this patch tested?
      
      existing test case.
      
      Author: caoxuewen <cao.xuewen@zte.com.cn>
      
      Closes #19069 from heary-cao/shuffle_hash_join.
      235d2833
    • gatorsmile's avatar
      32d6d9d7
    • Bryan Cutler's avatar
      [SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher · 4133c1b0
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      This PR adds ML examples for the FeatureHasher transform in Scala, Java, Python.
      
      ## How was this patch tested?
      
      Manually ran examples and verified that output is consistent for different APIs
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #19024 from BryanCutler/ml-examples-FeatureHasher-SPARK-21810.
      4133c1b0
    • hyukjinkwon's avatar
      [SPARK-21764][TESTS] Fix tests failures on Windows: resources not being closed and incorrect paths · b30a11a6
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      `org.apache.spark.deploy.RPackageUtilsSuite`
      
      ```
       - jars without manifest return false *** FAILED *** (109 milliseconds)
         java.io.IOException: Unable to delete file: C:\projects\spark\target\tmp\1500266936418-0\dep1-c.jar
      ```
      
      `org.apache.spark.deploy.SparkSubmitSuite`
      
      ```
       - download one file to local *** FAILED *** (16 milliseconds)
         java.net.URISyntaxException: Illegal character in authority at index 6: s3a://C:\projects\spark\target\tmp\test2630198944759847458.jar
      
       - download list of files to local *** FAILED *** (0 milliseconds)
         java.net.URISyntaxException: Illegal character in authority at index 6: s3a://C:\projects\spark\target\tmp\test2783551769392880031.jar
      ```
      
      `org.apache.spark.scheduler.ReplayListenerSuite`
      
      ```
       - Replay compressed inprogress log file succeeding on partial read (156 milliseconds)
         Exception encountered when attempting to run a suite with class name:
         org.apache.spark.scheduler.ReplayListenerSuite *** ABORTED *** (1 second, 391 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-8f3cacd6-faad-4121-b901-ba1bba8025a0
      
       - End-to-end replay *** FAILED *** (62 milliseconds)
         java.io.IOException: No FileSystem for scheme: C
      
       - End-to-end replay with compression *** FAILED *** (110 milliseconds)
         java.io.IOException: No FileSystem for scheme: C
      ```
      
      `org.apache.spark.sql.hive.StatisticsSuite`
      
      ```
       - SPARK-21079 - analyze table with location different than that of individual partitions *** FAILED *** (875 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - SPARK-21079 - analyze partitioned table with only a subset of partitions visible *** FAILED *** (47 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      ```
      
      **Note:** this PR does not fix:
      
      `org.apache.spark.deploy.SparkSubmitSuite`
      
      ```
       - launch simple application with spark-submit with redaction *** FAILED *** (172 milliseconds)
         java.util.NoSuchElementException: next on empty iterator
      ```
      
      I can't reproduce this on my Windows machine but looks appearntly consistently failed on AppVeyor. This one is unclear to me yet and hard to debug so I did not include this one for now.
      
      **Note:** it looks there are more instances but it is hard to identify them partly due to flakiness and partly due to swarming logs and errors. Will probably go one more time if it is fine.
      
      ## How was this patch tested?
      
      Manually via AppVeyor:
      
      **Before**
      
      - `org.apache.spark.deploy.RPackageUtilsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/771-windows-fix/job/8t8ra3lrljuir7q4
      - `org.apache.spark.deploy.SparkSubmitSuite`: https://ci.appveyor.com/project/spark-test/spark/build/771-windows-fix/job/taquy84yudjjen64
      - `org.apache.spark.scheduler.ReplayListenerSuite`: https://ci.appveyor.com/project/spark-test/spark/build/771-windows-fix/job/24omrfn2k0xfa9xq
      - `org.apache.spark.sql.hive.StatisticsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/771-windows-fix/job/2079y1plgj76dc9l
      
      **After**
      
      - `org.apache.spark.deploy.RPackageUtilsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/775-windows-fix/job/3803dbfn89ne1164
      - `org.apache.spark.deploy.SparkSubmitSuite`: https://ci.appveyor.com/project/spark-test/spark/build/775-windows-fix/job/m5l350dp7u9a4xjr
      - `org.apache.spark.scheduler.ReplayListenerSuite`: https://ci.appveyor.com/project/spark-test/spark/build/775-windows-fix/job/565vf74pp6bfdk18
      - `org.apache.spark.sql.hive.StatisticsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/775-windows-fix/job/qm78tsk8c37jb6s4
      
      Jenkins tests are required and AppVeyor tests will be triggered.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18971 from HyukjinKwon/windows-fixes.
      b30a11a6
    • Sean Owen's avatar
      [SPARK-21806][MLLIB] BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading · 734ed7a7
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Prepend (0,p) to precision-recall curve not (0,1) where p matches lowest recall point
      
      ## How was this patch tested?
      
      Updated tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19038 from srowen/SPARK-21806.
      734ed7a7
    • Yuval Itzchakov's avatar
      [SPARK-21873][SS] - Avoid using `return` inside `CachedKafkaConsumer.get` · 8f0df6bc
      Yuval Itzchakov authored
      During profiling of a structured streaming application with Kafka as the source, I came across this exception:
      
      ![Structured Streaming Kafka Exceptions](https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png)
      
      This is a 1 minute sample, which caused 106K `NonLocalReturnControl` exceptions to be thrown.
      This happens because `CachedKafkaConsumer.get` is ran inside:
      
      `private def runUninterruptiblyIfPossible[T](body: => T): T`
      
      Where `body: => T` is the `get` method. Turning the method into a function means that in order to escape the `while` loop defined in `get` the runtime has to do dirty tricks which involve throwing the above exception.
      
      ## What changes were proposed in this pull request?
      
      Instead of using `return` (which is generally not recommended in Scala), we place the result of the `fetchData` method inside a local variable and use a boolean flag to indicate the status of fetching data, which we monitor as our predicate to the `while` loop.
      
      ## How was this patch tested?
      
      I've ran the `KafkaSourceSuite` to make sure regression passes. Since the exception isn't visible from user code, there is no way (at least that I could think of) to add this as a test to the existing suite.
      
      Author: Yuval Itzchakov <yuval.itzchakov@clicktale.com>
      
      Closes #19059 from YuvalItzchakov/master.
      8f0df6bc
    • liuxian's avatar
      [MINOR][TEST] Off -heap memory leaks for unit tests · d4895c9d
      liuxian authored
      ## What changes were proposed in this pull request?
      Free off -heap memory .
      I have checked all the unit tests.
      
      ## How was this patch tested?
      N/A
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #19075 from 10110346/memleak.
      d4895c9d
  2. Aug 29, 2017
    • Steve Loughran's avatar
      [SPARK-20886][CORE] HadoopMapReduceCommitProtocol to handle FileOutputCommitter.getWorkPath==null · e47f48c7
      Steve Loughran authored
      ## What changes were proposed in this pull request?
      
      Handles the situation where a `FileOutputCommitter.getWorkPath()` returns `null` by downgrading to the supplied `path` argument.
      
      The existing code does an  `Option(workPath.toString).getOrElse(path)`, which triggers an NPE in the `toString()` operation if the workPath == null. The code apparently was meant to handle this (hence the getOrElse() clause, but as the NPE has already occurred at that point the else-clause never gets invoked.
      
      ## How was this patch tested?
      
      Manually, with some later code review.
      
      Author: Steve Loughran <stevel@hortonworks.com>
      
      Closes #18111 from steveloughran/cloud/SPARK-20886-committer-NPE.
      e47f48c7
    • gatorsmile's avatar
      [SPARK-21845][SQL] Make codegen fallback of expressions configurable · 3d0e1742
      gatorsmile authored
      ## What changes were proposed in this pull request?
      We should make codegen fallback of expressions configurable. So far, it is always on. We might hide it when our codegen have compilation bugs. Thus, we should also disable the codegen fallback when running test cases.
      
      ## How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19062 from gatorsmile/fallbackCodegen.
      3d0e1742
    • he.qiao's avatar
      [SPARK-21813][CORE] Modify TaskMemoryManager.MAXIMUM_PAGE_SIZE_BYTES comments · fba9cc84
      he.qiao authored
      ## What changes were proposed in this pull request?
      The variable "TaskMemoryManager.MAXIMUM_PAGE_SIZE_BYTES" comment error, It shouldn't be 2^32-1, should be 2^31-1, That means the maximum value of int.
      
      ## How was this patch tested?
      Existing test cases
      
      Author: he.qiao <he.qiao17@zte.com.cn>
      
      Closes #19025 from Geek-He/08_23_comments.
      fba9cc84
    • Marcelo Vanzin's avatar
      [SPARK-21728][CORE] Allow SparkSubmit to use Logging. · d7b1fcf8
      Marcelo Vanzin authored
      This change initializes logging when SparkSubmit runs, using
      a configuration that should avoid printing log messages as
      much as possible with most configurations, and adds code to
      restore the Spark logging system to as close as possible to
      its initial state, so the Spark app being run can re-initialize
      logging with its own configuration.
      
      With that feature, some duplicate code in SparkSubmit can now
      be replaced with the existing methods in the Utils class, which
      could not be used before because they initialized logging. As part
      of that I also did some minor refactoring, moving methods that
      should really belong in DependencyUtils.
      
      The change also shuffles some code in SparkHadoopUtil so that
      SparkSubmit can create a Hadoop config like the rest of Spark
      code, respecting the user's Spark configuration.
      
      The behavior was verified running spark-shell, pyspark and
      normal applications, then verifying the logging behavior,
      with and without dependency downloads.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #19013 from vanzin/SPARK-21728.
      d7b1fcf8
    • Joseph K. Bradley's avatar
      [MINOR][ML] Document treatment of instance weights in logreg summary · 840ba053
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Add Scaladoc noting that instance weights are currently ignored in the logistic regression summary traits.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #19071 from jkbradley/lr-summary-minor.
      840ba053
    • Felix Cheung's avatar
      [SPARK-21801][SPARKR][TEST] unit test randomly fail with randomforest · 6077e3ef
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      fix the random seed to eliminate variability
      
      ## How was this patch tested?
      
      jenkins, appveyor, lots more jenkins
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #19018 from felixcheung/rrftest.
      6077e3ef
    • Wenchen Fan's avatar
      [SPARK-21255][SQL] simplify encoder for java enum · 6327ea57
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up for https://github.com/apache/spark/pull/18488, to simplify the code.
      
      The major change is, we should map java enum to string type, instead of a struct type with a single string field.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #19066 from cloud-fan/fix.
      6327ea57
    • Wang Gengliang's avatar
      [SPARK-21848][SQL] Add trait UserDefinedExpression to identify user-defined functions · 8fcbda9c
      Wang Gengliang authored
      ## What changes were proposed in this pull request?
      
      Add trait UserDefinedExpression to identify user-defined functions.
      UDF can be expensive. In optimizer we may need to avoid executing UDF multiple times.
      E.g.
      ```scala
      table.select(UDF as 'a).select('a, ('a + 1) as 'b)
      ```
      If UDF is expensive in this case, optimizer should not collapse the project to
      ```scala
      table.select(UDF as 'a, (UDF+1) as 'b)
      ```
      
      Currently UDF classes like PythonUDF, HiveGenericUDF are not defined in catalyst.
      This PR is to add a new trait to make it easier to identify user-defined functions.
      
      ## How was this patch tested?
      
      Unit test
      
      Author: Wang Gengliang <ltnwgl@gmail.com>
      
      Closes #19064 from gengliangwang/UDFType.
      8fcbda9c
    • Takuya UESHIN's avatar
      [SPARK-21781][SQL] Modify DataSourceScanExec to use concrete ColumnVector type. · 32fa0b81
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      As mentioned at https://github.com/apache/spark/pull/18680#issuecomment-316820409, when we have more `ColumnVector` implementations, it might (or might not) have huge performance implications because it might disable inlining, or force virtual dispatches.
      
      As for read path, one of the major paths is the one generated by `ColumnBatchScan`. Currently it refers `ColumnVector` so the penalty will be bigger as we have more classes, but we can know the concrete type from its usage, e.g. vectorized Parquet reader uses `OnHeapColumnVector`. We can use the concrete type in the generated code directly to avoid the penalty.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #18989 from ueshin/issues/SPARK-21781.
      32fa0b81
  3. Aug 28, 2017
    • Weichen Xu's avatar
      [SPARK-17139][ML] Add model summary for MultinomialLogisticRegression · c7270a46
      Weichen Xu authored
      ## What changes were proposed in this pull request?
      
      Add 4 traits, using the following hierarchy:
      LogisticRegressionSummary
      LogisticRegressionTrainingSummary: LogisticRegressionSummary
      BinaryLogisticRegressionSummary: LogisticRegressionSummary
      BinaryLogisticRegressionTrainingSummary: LogisticRegressionTrainingSummary, BinaryLogisticRegressionSummary
      
      and the public method such as `def summary` only return trait type listed above.
      
      and then implement 4 concrete classes:
      LogisticRegressionSummaryImpl (multiclass case)
      LogisticRegressionTrainingSummaryImpl (multiclass case)
      BinaryLogisticRegressionSummaryImpl (binary case).
      BinaryLogisticRegressionTrainingSummaryImpl (binary case).
      
      ## How was this patch tested?
      
      Existing tests & added tests.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #15435 from WeichenXu123/mlor_summary.
      c7270a46
    • erenavsarogullari's avatar
      [SPARK-19662][SCHEDULER][TEST] Add Fair Scheduler Unit Test coverage for different build cases · 73e64f7d
      erenavsarogullari authored
      ## What changes were proposed in this pull request?
      Fair Scheduler can be built via one of the following options:
      - By setting a `spark.scheduler.allocation.file` property,
      - By setting `fairscheduler.xml` into classpath.
      
      These options are checked **in order** and fair-scheduler is built via first found option. If invalid path is found, `FileNotFoundException` will be expected.
      
      This PR aims unit test coverage of these use cases and a minor documentation change has been added for second option(`fairscheduler.xml` into classpath) to inform the users.
      
      Also, this PR was related with #16813 and has been created separately to keep patch content as isolated and to help the reviewers.
      
      ## How was this patch tested?
      Added new Unit Tests.
      
      Author: erenavsarogullari <erenavsarogullari@gmail.com>
      
      Closes #16992 from erenavsarogullari/SPARK-19662.
      73e64f7d
    • pgandhi's avatar
      [SPARK-21798] No config to replace deprecated SPARK_CLASSPATH config for... · 24e6c187
      pgandhi authored
      [SPARK-21798] No config to replace deprecated SPARK_CLASSPATH config for launching daemons like History Server
      
      History Server Launch uses SparkClassCommandBuilder for launching the server. It is observed that SPARK_CLASSPATH has been removed and deprecated. For spark-submit this takes a different route and spark.driver.extraClasspath takes care of specifying additional jars in the classpath that were previously specified in the SPARK_CLASSPATH. Right now the only way specify the additional jars for launching daemons such as history server is using SPARK_DIST_CLASSPATH (https://spark.apache.org/docs/latest/hadoop-provided.html) but this I presume is a distribution classpath. It would be nice to have a similar config like spark.driver.extraClasspath for launching daemons similar to history server.
      
      Added new environment variable SPARK_DAEMON_CLASSPATH to set classpath for launching daemons. Tested and verified for History Server and Standalone Mode.
      
      ## How was this patch tested?
      Initially, history server start script would fail for the reason being that it could not find the required jars for launching the server in the java classpath. Same was true for running Master and Worker in standalone mode. By adding the environment variable SPARK_DAEMON_CLASSPATH to the java classpath, both the daemons(History Server, Standalone daemons) are starting up and running.
      
      Author: pgandhi <pgandhi@yahoo-inc.com>
      Author: pgandhi999 <parthkgandhi9@gmail.com>
      
      Closes #19047 from pgandhi999/master.
      24e6c187
    • WeichenXu's avatar
      [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSummarizer.variance generate negative result · 0456b405
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      Because of numerical error, MultivariateOnlineSummarizer.variance is possible to generate negative variance.
      
      **This is a serious bug because many algos in MLLib**
      **use stddev computed from** `sqrt(variance)`
      **it will generate NaN and crash the whole algorithm.**
      
      we can reproduce this bug use the following code:
      ```
          val summarizer1 = (new MultivariateOnlineSummarizer)
            .add(Vectors.dense(3.0), 0.7)
          val summarizer2 = (new MultivariateOnlineSummarizer)
            .add(Vectors.dense(3.0), 0.4)
          val summarizer3 = (new MultivariateOnlineSummarizer)
            .add(Vectors.dense(3.0), 0.5)
          val summarizer4 = (new MultivariateOnlineSummarizer)
            .add(Vectors.dense(3.0), 0.4)
      
          val summarizer = summarizer1
            .merge(summarizer2)
            .merge(summarizer3)
            .merge(summarizer4)
      
          println(summarizer.variance(0))
      ```
      This PR fix the bugs in `mllib.stat.MultivariateOnlineSummarizer.variance` and `ml.stat.SummarizerBuffer.variance`, and several places in `WeightedLeastSquares`
      
      ## How was this patch tested?
      
      test cases added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #19029 from WeichenXu123/fix_summarizer_var_bug.
      0456b405
  4. Aug 27, 2017
  5. Aug 25, 2017
    • hyukjinkwon's avatar
      [MINOR][DOCS] Minor doc fixes related with doc build and uses script dir in SQL doc gen script · 3b66b1c4
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes both:
      
      - Add information about Javadoc, SQL docs and few more information in `docs/README.md` and a comment in `docs/_plugins/copy_api_dirs.rb` related with Javadoc.
      
      - Adds some commands so that the script always runs the SQL docs build under `./sql` directory (for directly running `./sql/create-docs.sh` in the root directory).
      
      ## How was this patch tested?
      
      Manual tests with `jekyll build` and `./sql/create-docs.sh` in the root directory.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19019 from HyukjinKwon/minor-doc-build.
      3b66b1c4
    • Dongjoon Hyun's avatar
      [SPARK-21831][TEST] Remove `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite · 522e1f80
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      [SPARK-19025](https://github.com/apache/spark/pull/16869) removes SQLBuilder, so we don't need the following in HiveCompatibilitySuite.
      
      ```scala
      // Ensures that the plans generation use metastore relation and not OrcRelation
      // Was done because SqlBuilder does not work with plans having logical relation
      TestHive.setConf(HiveUtils.CONVERT_METASTORE_ORC, false)
      ```
      
      ## How was this patch tested?
      
      Pass the existing Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19043 from dongjoon-hyun/SPARK-21831.
      522e1f80
    • Sean Owen's avatar
      [SPARK-21837][SQL][TESTS] UserDefinedTypeSuite Local UDTs not actually testing what it intends · 1a598d71
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Adjust Local UDTs test to assert about results, and fix index of vector column. See JIRA for details.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19053 from srowen/SPARK-21837.
      1a598d71
    • vinodkc's avatar
      [SPARK-21756][SQL] Add JSON option to allow unquoted control characters · 51620e28
      vinodkc authored
      ## What changes were proposed in this pull request?
      
      This patch adds allowUnquotedControlChars option in JSON data source to allow JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters)
      
      ## How was this patch tested?
      Add new test cases
      
      Author: vinodkc <vinod.kc.in@gmail.com>
      
      Closes #19008 from vinodkc/br_fix_SPARK-21756.
      51620e28
    • Marcelo Vanzin's avatar
      [SPARK-17742][CORE] Fail launcher app handle if child process exits with error. · 628bdeab
      Marcelo Vanzin authored
      This is a follow up to cba826d0; that commit set the app handle state
      to "LOST" when the child process exited, but that can be ambiguous. This
      change sets the state to "FAILED" if the exit code was non-zero and
      the handle state wasn't a failure state, or "LOST" if the exit status
      was zero.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #19012 from vanzin/SPARK-17742.
      628bdeab
    • jerryshao's avatar
      [SPARK-21714][CORE][YARN] Avoiding re-uploading remote resources in yarn client mode · 1813c4a8
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      With SPARK-10643, Spark supports download resources from remote in client deploy mode. But the implementation overrides variables which representing added resources (like `args.jars`, `args.pyFiles`) to local path, And yarn client leverage this local path to re-upload resources to distributed cache. This is unnecessary to break the semantics of putting resources in a shared FS. So here proposed to fix it.
      
      ## How was this patch tested?
      
      This is manually verified with jars, pyFiles in local and remote storage, both in client and cluster mode.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #18962 from jerryshao/SPARK-21714.
      1813c4a8
    • Dongjoon Hyun's avatar
      [SPARK-21832][TEST] Merge SQLBuilderTest into ExpressionSQLBuilderSuite · 1f24ceee
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      After [SPARK-19025](https://github.com/apache/spark/pull/16869), there is no need to keep SQLBuilderTest.
      ExpressionSQLBuilderSuite is the only place to use it.
      This PR aims to remove SQLBuilderTest.
      
      ## How was this patch tested?
      
      Pass the updated `ExpressionSQLBuilderSuite`.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19044 from dongjoon-hyun/SPARK-21832.
      1f24ceee
    • Sean Owen's avatar
      [MINOR][BUILD] Fix build warnings and Java lint errors · de7af295
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix build warnings and Java lint errors. This just helps a bit in evaluating (new) warnings in another PR I have open.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19051 from srowen/JavaWarnings.
      de7af295
    • zhoukang's avatar
      [SPARK-21527][CORE] Use buffer limit in order to use JAVA NIO Util's buffercache · 574ef6c9
      zhoukang authored
      ## What changes were proposed in this pull request?
      
      Right now, ChunkedByteBuffer#writeFully do not slice bytes first.We observe code in java nio Util#getTemporaryDirectBuffer below:
      
              BufferCache cache = bufferCache.get();
              ByteBuffer buf = cache.get(size);
              if (buf != null) {
                  return buf;
              } else {
                  // No suitable buffer in the cache so we need to allocate a new
                  // one. To avoid the cache growing then we remove the first
                  // buffer from the cache and free it.
                  if (!cache.isEmpty()) {
                      buf = cache.removeFirst();
                      free(buf);
                  }
                  return ByteBuffer.allocateDirect(size);
              }
      
      If we slice first with a fixed size, we can use buffer cache and only need to allocate at the first write call.
      Since we allocate new buffer, we can not control the free time of this buffer.This once cause memory issue in our production cluster.
      In this patch, i supply a new api which will slice with fixed size for buffer writing.
      
      ## How was this patch tested?
      
      Unit test and test in production.
      
      Author: zhoukang <zhoukang199191@gmail.com>
      Author: zhoukang <zhoukang@xiaomi.com>
      
      Closes #18730 from caneGuy/zhoukang/improve-chunkwrite.
      574ef6c9
    • mike's avatar
      [SPARK-21255][SQL][WIP] Fixed NPE when creating encoder for enum · 7d16776d
      mike authored
      ## What changes were proposed in this pull request?
      
      Fixed NPE when creating encoder for enum.
      
      When you try to create an encoder for Enum type (or bean with enum property) via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
      I did a little research and it turns out, that in JavaTypeInference following code
      ```
        def getJavaBeanReadableProperties(beanClass: Class[_]): Array[PropertyDescriptor] = {
          val beanInfo = Introspector.getBeanInfo(beanClass)
          beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
            .filter(_.getReadMethod != null)
        }
      ```
      filters out properties named "class", because we wouldn't want to serialize that. But enum types have another property of type Class named "declaringClass", which we are trying to inspect recursively. Eventually we try to inspect ClassLoader class, which has property "defaultAssertionStatus" with no read method, which leads to NPE at TypeToken:495.
      
      I added property name "declaringClass" to filtering to resolve this.
      
      ## How was this patch tested?
      Unit test in JavaDatasetSuite which creates an encoder for enum
      
      Author: mike <mike0sv@gmail.com>
      Author: Mikhail Sveshnikov <mike0sv@gmail.com>
      
      Closes #18488 from mike0sv/enum-support.
      7d16776d
  6. Aug 24, 2017
    • Yuhao Yang's avatar
      [SPARK-21108][ML] convert LinearSVC to aggregator framework · f3676d63
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      convert LinearSVC to new aggregator framework
      
      ## How was this patch tested?
      
      existing unit test.
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      
      Closes #18315 from hhbyyh/svcAggregator.
      f3676d63
    • Herman van Hovell's avatar
      [SPARK-21830][SQL] Bump ANTLR version and fix a few issues. · 05af2de0
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      This PR bumps the ANTLR version to 4.7, and fixes a number of small parser related issues uncovered by the bump.
      
      The main reason for upgrading is that in some cases the current version of ANTLR (4.5) can exhibit exponential slowdowns if it needs to parse boolean predicates. For example the following query will take forever to parse:
      ```sql
      SELECT *
      FROM RANGE(1000)
      WHERE
      TRUE
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      ```
      
      This is caused by a know bug in ANTLR (https://github.com/antlr/antlr4/issues/994), which was fixed in version 4.6.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #19042 from hvanhovell/SPARK-21830.
      05af2de0
    • xu.zhang's avatar
      [SPARK-21701][CORE] Enable RPC client to use ` SO_RCVBUF` and ` SO_SNDBUF` in SparkConf. · 763b83ee
      xu.zhang authored
      ## What changes were proposed in this pull request?
      
      TCP parameters like SO_RCVBUF and SO_SNDBUF can be set in SparkConf, and `org.apache.spark.network.server.TransportServe`r can use those parameters to build server by leveraging netty. But for TransportClientFactory, there is no such way to set those parameters from SparkConf. This could be inconsistent in server and client side when people set parameters in SparkConf. So this PR make RPC client to be enable to use those TCP parameters as well.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: xu.zhang <xu.zhang@hulu.com>
      
      Closes #18964 from neoremind/add_client_param.
      763b83ee
    • Shixiong Zhu's avatar
      [SPARK-21788][SS] Handle more exceptions when stopping a streaming query · d3abb369
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Add more cases we should view as a normal query stop rather than a failure.
      
      ## How was this patch tested?
      
      The new unit tests.
      
      Author: Shixiong Zhu <zsxwing@gmail.com>
      
      Closes #18997 from zsxwing/SPARK-21788.
      d3abb369
    • Wenchen Fan's avatar
      [SPARK-21826][SQL] outer broadcast hash join should not throw NPE · 2dd37d82
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This is a bug introduced by https://github.com/apache/spark/pull/11274/files#diff-7adb688cbfa583b5711801f196a074bbL274 .
      
      Non-equal join condition should only be applied when the equal-join condition matches.
      
      ## How was this patch tested?
      
      regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #19036 from cloud-fan/bug.
      2dd37d82
    • Liang-Chi Hsieh's avatar
      [SPARK-21759][SQL] In.checkInputDataTypes should not wrongly report unresolved... · 183d4cb7
      Liang-Chi Hsieh authored
      [SPARK-21759][SQL] In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery
      
      ## What changes were proposed in this pull request?
      
      With the check for structural integrity proposed in SPARK-21726, it is found that the optimization rule `PullupCorrelatedPredicates` can produce unresolved plans.
      
      For a correlated IN query looks like:
      
          SELECT t1.a FROM t1
          WHERE
          t1.a IN (SELECT t2.c
                  FROM t2
                  WHERE t1.b < t2.d);
      
      The query plan might look like:
      
          Project [a#0]
          +- Filter a#0 IN (list#4 [b#1])
             :  +- Project [c#2]
             :     +- Filter (outer(b#1) < d#3)
             :        +- LocalRelation <empty>, [c#2, d#3]
             +- LocalRelation <empty>, [a#0, b#1]
      
      After `PullupCorrelatedPredicates`, it produces query plan like:
      
          'Project [a#0]
          +- 'Filter a#0 IN (list#4 [(b#1 < d#3)])
             :  +- Project [c#2, d#3]
             :     +- LocalRelation <empty>, [c#2, d#3]
             +- LocalRelation <empty>, [a#0, b#1]
      
      Because the correlated predicate involves another attribute `d#3` in subquery, it has been pulled out and added into the `Project` on the top of the subquery.
      
      When `list` in `In` contains just one `ListQuery`, `In.checkInputDataTypes` checks if the size of `value` expressions matches the output size of subquery. In the above example, there is only `value` expression and the subquery output has two attributes `c#2, d#3`, so it fails the check and `In.resolved` returns `false`.
      
      We should not let `In.checkInputDataTypes` wrongly report unresolved plans to fail the structural integrity check.
      
      ## How was this patch tested?
      
      Added test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18968 from viirya/SPARK-21759.
      183d4cb7
    • Takuya UESHIN's avatar
      [SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector... · 9e33954d
      Takuya UESHIN authored
      [SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce WritableColumnVector.
      
      ## What changes were proposed in this pull request?
      
      This is a refactoring of `ColumnVector` hierarchy and related classes.
      
      1. make `ColumnVector` read-only
      2. introduce `WritableColumnVector` with write interface
      3. remove `ReadOnlyColumnVector`
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #18958 from ueshin/issues/SPARK-21745.
      9e33954d
    • hyukjinkwon's avatar
      [SPARK-19165][PYTHON][SQL] PySpark APIs using columns as arguments should... · dc5d34d8
      hyukjinkwon authored
      [SPARK-19165][PYTHON][SQL] PySpark APIs using columns as arguments should validate input types for column
      
      ## What changes were proposed in this pull request?
      
      While preparing to take over https://github.com/apache/spark/pull/16537, I realised a (I think) better approach to make the exception handling in one point.
      
      This PR proposes to fix `_to_java_column` in `pyspark.sql.column`, which most of functions in `functions.py` and some other APIs use. This `_to_java_column` basically looks not working with other types than `pyspark.sql.column.Column` or string (`str` and `unicode`).
      
      If this is not `Column`, then it calls `_create_column_from_name` which calls `functions.col` within JVM:
      
      https://github.com/apache/spark/blob/42b9eda80e975d970c3e8da4047b318b83dd269f/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L76
      
      And it looks we only have `String` one with `col`.
      
      So, these should work:
      
      ```python
      >>> from pyspark.sql.column import _to_java_column, Column
      >>> _to_java_column("a")
      JavaObject id=o28
      >>> _to_java_column(u"a")
      JavaObject id=o29
      >>> _to_java_column(spark.range(1).id)
      JavaObject id=o33
      ```
      
      whereas these do not:
      
      ```python
      >>> _to_java_column(1)
      ```
      ```
      ...
      py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
      py4j.Py4JException: Method col([class java.lang.Integer]) does not exist
          ...
      ```
      
      ```python
      >>> _to_java_column([])
      ```
      ```
      ...
      py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
      py4j.Py4JException: Method col([class java.util.ArrayList]) does not exist
          ...
      ```
      
      ```python
      >>> class A(): pass
      >>> _to_java_column(A())
      ```
      ```
      ...
      AttributeError: 'A' object has no attribute '_get_object_id'
      ```
      
      Meaning most of functions using `_to_java_column` such as `udf` or `to_json` or some other APIs throw an exception as below:
      
      ```python
      >>> from pyspark.sql.functions import udf
      >>> udf(lambda x: x)(None)
      ```
      
      ```
      ...
      py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.col.
      : java.lang.NullPointerException
          ...
      ```
      
      ```python
      >>> from pyspark.sql.functions import to_json
      >>> to_json(None)
      ```
      
      ```
      ...
      py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.col.
      : java.lang.NullPointerException
          ...
      ```
      
      **After this PR**:
      
      ```python
      >>> from pyspark.sql.functions import udf
      >>> udf(lambda x: x)(None)
      ...
      ```
      
      ```
      TypeError: Invalid argument, not a string or column: None of type <type 'NoneType'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' functions.
      ```
      
      ```python
      >>> from pyspark.sql.functions import to_json
      >>> to_json(None)
      ```
      
      ```
      ...
      TypeError: Invalid argument, not a string or column: None of type <type 'NoneType'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' functions.
      ```
      
      ## How was this patch tested?
      
      Unit tests added in `python/pyspark/sql/tests.py` and manual tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #19027 from HyukjinKwon/SPARK-19165.
      dc5d34d8
    • Jen-Ming Chung's avatar
      [SPARK-21804][SQL] json_tuple returns null values within repeated columns except the first one · 95713eb4
      Jen-Ming Chung authored
      ## What changes were proposed in this pull request?
      
      When json_tuple in extracting values from JSON it returns null values within repeated columns except the first one as below:
      
      ``` scala
      scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 'a')""").show()
      +---+---+----+
      | c0| c1|  c2|
      +---+---+----+
      |  1|  2|null|
      +---+---+----+
      ```
      
      I think this should be consistent with Hive's implementation:
      ```
      hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a');
      ...
      1    1
      ```
      
      In this PR, we located all the matched indices in `fieldNames` instead of returning the first matched index, i.e., indexOf.
      
      ## How was this patch tested?
      
      Added test in JsonExpressionsSuite.
      
      Author: Jen-Ming Chung <jenmingisme@gmail.com>
      
      Closes #19017 from jmchung/SPARK-21804.
      95713eb4
Loading