Skip to content
Snippets Groups Projects
  1. Apr 29, 2016
    • Yin Huai's avatar
      [SPARK-15019][SQL] Propagate all Spark Confs to HiveConf created in HiveClientImpl · b33d6b72
      Yin Huai authored
      ## What changes were proposed in this pull request?
      This PR makes two changes:
      1. We will propagate Spark Confs to HiveConf created in HiveClientImpl. So, users can also use spark conf to set warehouse location and metastore url.
      2. In sql/hive, HiveClientImpl will be the only place where we create a new HiveConf.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #12791 from yhuai/onlyUseHiveConfInHiveClientImpl.
      b33d6b72
    • tedyu's avatar
      [SPARK-15003] Use ConcurrentHashMap in place of HashMap for NewAccumulator.originals · dcfaeade
      tedyu authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to use ConcurrentHashMap in place of HashMap for NewAccumulator.originals
      
      This should result in better performance.
      
      ## How was this patch tested?
      
      Existing unit test suite
      
      cloud-fan
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #12776 from tedyu/master.
      dcfaeade
    • Herman van Hovell's avatar
      [SPARK-14858] [SQL] Enable subquery pushdown · 83061be6
      Herman van Hovell authored
      The previous subquery PRs did not include support for pushing subqueries used in filters (`WHERE`/`HAVING`) down. This PR adds this support. For example :
      ```scala
      range(0, 10).registerTempTable("a")
      range(5, 15).registerTempTable("b")
      range(7, 25).registerTempTable("c")
      range(3, 12).registerTempTable("d")
      val plan = sql("select * from a join b on a.id = b.id left join c on c.id = b.id where a.id in (select id from d)")
      plan.explain(true)
      ```
      Leads to the following Analyzed & Optimized plans:
      ```
      == Parsed Logical Plan ==
      ...
      
      == Analyzed Logical Plan ==
      id: bigint, id: bigint, id: bigint
      Project [id#0L,id#4L,id#8L]
      +- Filter predicate-subquery#16 [(id#0L = id#12L)]
         :  +- SubqueryAlias predicate-subquery#16 [(id#0L = id#12L)]
         :     +- Project [id#12L]
         :        +- SubqueryAlias d
         :           +- Range 3, 12, 1, 8, [id#12L]
         +- Join LeftOuter, Some((id#8L = id#4L))
            :- Join Inner, Some((id#0L = id#4L))
            :  :- SubqueryAlias a
            :  :  +- Range 0, 10, 1, 8, [id#0L]
            :  +- SubqueryAlias b
            :     +- Range 5, 15, 1, 8, [id#4L]
            +- SubqueryAlias c
               +- Range 7, 25, 1, 8, [id#8L]
      
      == Optimized Logical Plan ==
      Join LeftOuter, Some((id#8L = id#4L))
      :- Join Inner, Some((id#0L = id#4L))
      :  :- Join LeftSemi, Some((id#0L = id#12L))
      :  :  :- Range 0, 10, 1, 8, [id#0L]
      :  :  +- Range 3, 12, 1, 8, [id#12L]
      :  +- Range 5, 15, 1, 8, [id#4L]
      +- Range 7, 25, 1, 8, [id#8L]
      
      == Physical Plan ==
      ...
      ```
      I have also taken the opportunity to move quite a bit of code around:
      - Rewriting subqueris and pulling out correlated predicated from subqueries has been moved into the analyzer. The analyzer transforms `Exists` and `InSubQuery` into `PredicateSubquery` expressions. A PredicateSubquery exposes the 'join' expressions and the proper references. This makes things like type coercion, optimization and planning easier to do.
      - I have added support for `Aggregate` plans in subqueries. Any correlated expressions will be added to the grouping expressions. I have removed support for `Union` plans, since pulling in an outer reference from beneath a Union has no value (a filtered value could easily be part of another Union child).
      - Resolution of subqueries is now done using `OuterReference`s. These are used to wrap any outer reference; this makes the identification of these references easier, and also makes dealing with duplicate attributes in the outer and inner plans easier. The resolution of subqueries initially used a resolution loop which would alternate between calling the analyzer and trying to resolve the outer references. We now use a dedicated analyzer which uses a special rule for outer reference resolution.
      
      These changes are a stepping stone for enabling correlated scalar subqueries, enabling all Hive tests & allowing us to use predicate subqueries anywhere.
      
      Current tests and added test cases in FilterPushdownSuite.
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #12720 from hvanhovell/SPARK-14858.
      83061be6
    • Joseph K. Bradley's avatar
      [SPARK-14646][ML] Modified Kmeans to store cluster centers with one per row · 1eda2f10
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Modified Kmeans to store cluster centers with one per row
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12792 from jkbradley/kmeans-save-fix.
      1eda2f10
    • Andrew Or's avatar
      [SPARK-14988][PYTHON] SparkSession API follow-ups · d33e3d57
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Addresses comments in #12765.
      
      ## How was this patch tested?
      
      Python tests.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12784 from andrewor14/python-followup.
      d33e3d57
    • Sun Rui's avatar
      [SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR. · 4ae9fe09
      Sun Rui authored
      ## What changes were proposed in this pull request?
      
      dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame.
      
      The function signature is:
      
      	dapply(df, function(localDF) {}, schema = NULL)
      
      R function input: local data.frame from the partition on local node
      R function output: local data.frame
      
      Schema specifies the Row format of the resulting DataFrame. It must match the R function's output.
      If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply().
      
      ## How was this patch tested?
      SparkR unit tests.
      
      Author: Sun Rui <rui.sun@intel.com>
      Author: Sun Rui <sunrui2016@gmail.com>
      
      Closes #12493 from sun-rui/SPARK-12919.
      4ae9fe09
    • BenFradet's avatar
      [SPARK-14570][ML] Log instrumentation in Random forests · d78fbcc3
      BenFradet authored
      ## What changes were proposed in this pull request?
      
      Added Instrumentation logging to DecisionTree{Classifier,Regressor} and RandomForest{Classifier,Regressor}
      
      ## How was this patch tested?
      
      No tests involved since it's logging related.
      
      Author: BenFradet <benjamin.fradet@gmail.com>
      
      Closes #12536 from BenFradet/SPARK-14570.
      d78fbcc3
    • Yin Huai's avatar
      [SPARK-15013][SQL] Remove hiveConf from HiveSessionState · af32f4ae
      Yin Huai authored
      ## What changes were proposed in this pull request?
      The hiveConf in HiveSessionState is not actually used anymore. Let's remove it.
      
      ## How was this patch tested?
      Existing tests
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #12786 from yhuai/removeHiveConf.
      af32f4ae
    • Cheng Lian's avatar
      [SPARK-14981][SQL] Throws exception if DESC is specified for sorting columns · a04b1de5
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      Currently Spark SQL doesn't support sorting columns in descending order. However, the parser accepts the syntax and silently drops sorting directions. This PR fixes this by throwing an exception if `DESC` is specified as sorting direction of a sorting column.
      
      ## How was this patch tested?
      
      A test case is added to test the invalid sorting order by checking exception message.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #12759 from liancheng/spark-14981.
      a04b1de5
    • Reynold Xin's avatar
      [SPARK-15004][SQL] Remove zookeeper service discovery code in thrift-server · 8ebae466
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We recently inlined Hive's thrift server code in SPARK-15004. This patch removes the code related to zookeeper service discovery, Tez, and Hive on Spark, since they are irrelevant.
      
      ## How was this patch tested?
      N/A - removing dead code
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12780 from rxin/SPARK-15004.
      8ebae466
    • Yin Huai's avatar
      [SPARK-15011][SQL][TEST] Ignore org.apache.spark.sql.hive.StatisticsSuite.analyze MetastoreRelation · ac115f66
      Yin Huai authored
      This test always fail with sbt's hadoop 2.3 and 2.4 tests. Let'e disable it for now and investigate the problem.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #12783 from yhuai/SPARK-15011-ignore.
      ac115f66
    • Jeff Zhang's avatar
      [SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA PR2 · 775772de
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      pyspark.ml API for LDA
      * LDA, LDAModel, LocalLDAModel, DistributedLDAModel
      * includes persistence
      
      This replaces [https://github.com/apache/spark/pull/10242]
      
      ## How was this patch tested?
      
      * doc test for LDA, including Param setters
      * unit test for persistence
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #12723 from jkbradley/zjffdu-SPARK-11940.
      775772de
    • Joseph K. Bradley's avatar
      [SPARK-14984][ML] Deprecated model field in LinearRegressionSummary · f08dcdb8
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Deprecated model field in LinearRegressionSummary
      
      Removed unnecessary Since annotations
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12763 from jkbradley/lr-summary-api.
      f08dcdb8
    • Yanbo Liang's avatar
      [SPARK-14314][SPARK-14315][ML][SPARKR] Model persistence in SparkR (glm & kmeans) · 87ac84d4
      Yanbo Liang authored
      SparkR ```glm``` and ```kmeans``` model persistence.
      
      Unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      Author: Gayathri Murali <gayathri.m.softie@gmail.com>
      
      Closes #12778 from yanboliang/spark-14311.
      Closes #12680
      Closes #12683
      87ac84d4
    • Andrew Or's avatar
      [SPARK-14988][PYTHON] SparkSession catalog and conf API · a7d0fedc
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      The `catalog` and `conf` APIs were exposed in `SparkSession` in #12713 and #12669. This patch adds those to the python API.
      
      ## How was this patch tested?
      
      Python tests.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12765 from andrewor14/python-spark-session-more.
      a7d0fedc
    • Davies Liu's avatar
      [SPARK-14987][SQL] inline hive-service (cli) into sql/hive-thriftserver · 7feeb82c
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR copy the thrift-server from hive-service-1.2 (including  TCLIService.thrift and generated Java source code) into sql/hive-thriftserver, so we can do further cleanup and improvements.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12764 from davies/thrift_server.
      7feeb82c
    • wm624@hotmail.com's avatar
      [SPARK-14571][ML] Log instrumentation in ALS · b6fa7e59
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      Add log instrumentation for parameters:
      rank, numUserBlocks, numItemBlocks, implicitPrefs, alpha,
      userCol, itemCol, ratingCol, predictionCol, maxIter,
      regParam, nonnegative, checkpointInterval, seed
      
      Add log instrumentation for numUserFeatures and numItemFeatures
      
      ## How was this patch tested?
      
      Manual test: Set breakpoint in intellij and run def testALS(). Single step debugging and check the log method is called.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #12560 from wangmiao1981/log.
      b6fa7e59
    • dding3's avatar
      [SPARK-14969][MLLIB] Remove duplicate implementation of compute in LogisticGradient · 6d5aeaae
      dding3 authored
      ## What changes were proposed in this pull request?
      
      This PR removes duplicate implementation of compute in LogisticGradient class
      
      ## How was this patch tested?
      
      unit tests
      
      Author: dding3 <dingding@dingding-ubuntu.sh.intel.com>
      
      Closes #12747 from dding3/master.
      6d5aeaae
    • Jakob Odersky's avatar
      [SPARK-14511][BUILD] Upgrade genjavadoc to latest upstream · 7226e190
      Jakob Odersky authored
      ## What changes were proposed in this pull request?
      In the past, genjavadoc had issues with package private members which led the spark project to use a forked version. This issue has been fixed upstream (typesafehub/genjavadoc#70) and a release is available for scala versions 2.10, 2.11 **and 2.12**, hence a forked version for spark is no longer necessary.
      This pull request updates the build configuration to use the newest upstream genjavadoc.
      
      ## How was this patch tested?
      The build was run `sbt unidoc`. During the process javadoc emits some errors on the generated java stubs, however these errors were also present before the upgrade. Furthermore, the produced html is fine.
      
      Author: Jakob Odersky <jakob@odersky.com>
      
      Closes #12707 from jodersky/SPARK-14511-genjavadoc.
      7226e190
    • Reynold Xin's avatar
      [SPARK-14994][SQL] Remove execution hive from HiveSessionState · 054f991c
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes executionHive from HiveSessionState and HiveSharedState.
      
      ## How was this patch tested?
      Updated test cases.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #12770 from rxin/SPARK-14994.
      054f991c
    • Sameer Agarwal's avatar
      [SPARK-14996][SQL] Add TPCDS Benchmark Queries for SparkSQL · 2057cbcb
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This PR adds support for easily running and benchmarking a set of common TPCDS queries locally in SparkSQL.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #12771 from sameeragarwal/tpcds-2.
      2057cbcb
    • gatorsmile's avatar
      [SPARK-12660][SPARK-14967][SQL] Implement Except Distinct by Left Anti Join · 222dcf79
      gatorsmile authored
      #### What changes were proposed in this pull request?
      Replaces a logical `Except` operator with a `Left-anti Join` operator. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).
      ```SQL
        SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2
        ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
      ```
       Note:
       1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL.
       2. This rule has to be done after de-duplicating the attributes; otherwise, the enerated
          join conditions will be incorrect.
      
      This PR also corrects the existing behavior in Spark. Before this PR, the behavior is like
      ```SQL
        test("except") {
          val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id")
          val df_right = Seq(1, 3).toDF("id")
      
          checkAnswer(
            df_left.except(df_right),
            Row(2) :: Row(2) :: Row(4) :: Nil
          )
        }
      ```
      After this PR, the result is corrected. We strictly follow the SQL compliance of `Except Distinct`.
      
      #### How was this patch tested?
      Modified and added a few test cases to verify the optimization rule and the results of operators.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #12736 from gatorsmile/exceptByAntiJoin.
      222dcf79
    • Reynold Xin's avatar
    • Sean Owen's avatar
      [SPARK-14886][MLLIB] RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException · d1cf3201
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Handle case where number of predictions is less than label set, k in nDCG computation
      
      ## How was this patch tested?
      
      New unit test; existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #12756 from srowen/SPARK-14886.
      d1cf3201
    • Reynold Xin's avatar
    • Zheng RuiFeng's avatar
      [MINOR][DOC] Minor typo fixes · 4f83e442
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Minor typo fixes
      
      ## How was this patch tested?
      local build
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12755 from zhengruifeng/fix_doc_dataset.
      4f83e442
    • Zheng RuiFeng's avatar
      [SPARK-14829][MLLIB] Deprecate GLM APIs using SGD · cabd54d9
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      According to the [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829), deprecate API of LogisticRegression and LinearRegression using SGD
      
      ## How was this patch tested?
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12596 from zhengruifeng/deprecate_sgd.
      cabd54d9
    • Timothy Hunter's avatar
      [SPARK-7264][ML] Parallel lapply for sparkR · 769a909d
      Timothy Hunter authored
      ## What changes were proposed in this pull request?
      
      This PR adds a new function in SparkR called `sparkLapply(list, function)`. This function implements a distributed version of `lapply` using Spark as a backend.
      
      TODO:
       - [x] check documentation
       - [ ] check tests
      
      Trivial example in SparkR:
      
      ```R
      sparkLapply(1:5, function(x) { 2 * x })
      ```
      
      Output:
      
      ```
      [[1]]
      [1] 2
      
      [[2]]
      [1] 4
      
      [[3]]
      [1] 6
      
      [[4]]
      [1] 8
      
      [[5]]
      [1] 10
      ```
      
      Here is a slightly more complex example to perform distributed training of multiple models. Under the hood, Spark broadcasts the dataset.
      
      ```R
      library("MASS")
      data(menarche)
      families <- c("gaussian", "poisson")
      train <- function(family){glm(Menarche ~ Age  , family=family, data=menarche)}
      results <- sparkLapply(families, train)
      ```
      
      ## How was this patch tested?
      
      This PR was tested in SparkR. I am unfamiliar with R and SparkR, so any feedback on style, testing, etc. will be much appreciated.
      
      cc falaki davies
      
      Author: Timothy Hunter <timhunter@databricks.com>
      
      Closes #12426 from thunterdb/7264.
      769a909d
  2. Apr 28, 2016
    • Reynold Xin's avatar
      [SPARK-14991][SQL] Remove HiveNativeCommand · 4607f6e7
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes HiveNativeCommand, so we can continue to remove the dependency on Hive. This pull request also removes the ability to generate golden result file using Hive.
      
      ## How was this patch tested?
      Updated tests to reflect this.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12769 from rxin/SPARK-14991.
      4607f6e7
    • Wenchen Fan's avatar
      [HOTFIX][CORE] fix a concurrence issue in NewAccumulator · 6f9a18fe
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      `AccumulatorContext` is not thread-safe, that's why all of its methods are synchronized. However, there is one exception: the `AccumulatorContext.originals`. `NewAccumulator` use it to check if it's registered, which is wrong as it's not synchronized.
      
      This PR mark `AccumulatorContext.originals` as `private` and now all access to `AccumulatorContext` is synchronized.
      
      ## How was this patch tested?
      
      I verified it locally. To be safe, we can let jenkins test it many times to make sure this problem is gone.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12773 from cloud-fan/debug.
      6f9a18fe
    • Yin Huai's avatar
    • jerryshao's avatar
      [SPARK-14836][YARN] Zip all the jars before uploading to distributed cache · 2398e3d6
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      <copy form JIRA>
      
      Currently if neither `spark.yarn.jars` nor `spark.yarn.archive` is set (by default), Spark on yarn code will upload all the jars in the folder separately into distributed cache, this is quite time consuming, and very verbose, instead of upload jars separately into distributed cache, here changes to zip all the jars first, and then put into distributed cache.
      
      This will significantly improve the speed of starting time.
      
      ## How was this patch tested?
      
      Unit test and local integrated test is done.
      
      Verified with SparkPi both in spark cluster and client mode.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #12597 from jerryshao/SPARK-14836.
      2398e3d6
    • Joseph K. Bradley's avatar
      [SPARK-14862][ML] Updated Classifiers to not require labelCol metadata · 4f4721a2
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Updated Classifier, DecisionTreeClassifier, RandomForestClassifier, GBTClassifier to not require input column metadata.
      * They first check for metadata.
      * If numClasses is not specified in metadata, they identify the largest label value (up to a limit).
      
      This functionality is implemented in a new Classifier.getNumClasses method.
      
      Also
      * Updated Classifier.extractLabeledPoints to (a) check label values and (b) include a second version which takes a numClasses value for validity checking.
      
      ## How was this patch tested?
      
      * Unit tests in ClassifierSuite for helper methods
      * Unit tests for DecisionTreeClassifier, RandomForestClassifier, GBTClassifier with toy datasets lacking label metadata
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12663 from jkbradley/trees-no-metadata.
      4f4721a2
    • Pravin Gadakh's avatar
      [SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local · dae538a4
      Pravin Gadakh authored
      ## What changes were proposed in this pull request?
      
      This PR adds `since` tag into the matrix and vector classes in spark-mllib-local.
      
      ## How was this patch tested?
      
      Scala-style checks passed.
      
      Author: Pravin Gadakh <prgadakh@in.ibm.com>
      
      Closes #12416 from pravingadakh/SPARK-14613.
      dae538a4
    • Burak Yavuz's avatar
      [SPARK-14555] Second cut of Python API for Structured Streaming · 78c8aaf8
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      This PR adds Python APIs for:
       - `ContinuousQueryManager`
       - `ContinuousQueryException`
      
      The `ContinuousQueryException` is a very basic wrapper, it doesn't provide the functionality that the Scala side provides, but it follows the same pattern for `AnalysisException`.
      
      For `ContinuousQueryManager`, all APIs are provided except for registering listeners.
      
      This PR also attempts to fix test flakiness by stopping all active streams just before tests.
      
      ## How was this patch tested?
      
      Python Doc tests and unit tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #12673 from brkyvz/pyspark-cqm.
      78c8aaf8
    • Kai Jiang's avatar
      [SPARK-12810][PYSPARK] PySpark CrossValidatorModel should support avgMetrics · d584a2b8
      Kai Jiang authored
      ## What changes were proposed in this pull request?
      support avgMetrics in CrossValidatorModel with Python
      ## How was this patch tested?
      Doctest and `test_save_load` in `pyspark/ml/test.py`
      [JIRA](https://issues.apache.org/jira/browse/SPARK-12810)
      
      Author: Kai Jiang <jiangkai@gmail.com>
      
      Closes #12464 from vectorijk/spark-12810.
      d584a2b8
    • Tathagata Das's avatar
      [SPARK-14970][SQL] Prevent DataSource from enumerates all files in a directory... · 0ee5419b
      Tathagata Das authored
      [SPARK-14970][SQL] Prevent DataSource from enumerates all files in a directory if there is user specified schema
      
      ## What changes were proposed in this pull request?
      The FileCatalog object gets created even if the user specifies schema, which means files in the directory is enumerated even thought its not necessary. For large directories this is very slow. User would want to specify schema in such scenarios of large dirs, and this defeats the purpose quite a bit.
      
      ## How was this patch tested?
      Hard to test this with unit test.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #12748 from tdas/SPARK-14970.
      0ee5419b
    • Yuhao Yang's avatar
      [SPARK-14916][MLLIB] A more friendly tostring for FreqItemset in mllib.fpm · d5ab42ce
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      jira: https://issues.apache.org/jira/browse/SPARK-14916
      FreqItemset as the result of FPGrowth should have a more friendly toString(), to help users and developers.
      sample:
      {a, b}: 5
      {x, y, z}: 4
      
      ## How was this patch tested?
      
      existing unit tests.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #12698 from hhbyyh/freqtos.
      d5ab42ce
    • Joseph K. Bradley's avatar
      [SPARK-14852][ML] refactored GLM summary into training, non-training summaries · 5ee72454
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      This splits GeneralizedLinearRegressionSummary into 2 summary types:
      * GeneralizedLinearRegressionSummary, which does not store info from fitting (diagInvAtWA)
      * GeneralizedLinearRegressionTrainingSummary, which is a subclass of GeneralizedLinearRegressionSummary and stores info from fitting
      
      This also add a method evaluate() which can produce a GeneralizedLinearRegressionSummary on a new dataset.
      
      The summary no longer provides the model itself as a public val.
      
      Also:
      * Fixes bug where GeneralizedLinearRegressionTrainingSummary was created with model, not summaryModel.
      * Adds hasSummary method.
      * Renames findSummaryModelAndPredictionCol -> getSummaryModel and simplifies that method.
      * In summary, extract values from model immediately in case user later changes those (e.g., predictionCol).
      * Pardon the style fixes; that is IntelliJ being obnoxious.
      
      ## How was this patch tested?
      
      Existing unit tests + updated test for evaluate and hasSummary
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12624 from jkbradley/model-summary-api.
      5ee72454
    • Gregory Hart's avatar
      [SPARK-14965][SQL] Indicate an exception is thrown for a missing struct field · 12c360c0
      Gregory Hart authored
      ## What changes were proposed in this pull request?
      
      Fix to ScalaDoc for StructType.
      
      ## How was this patch tested?
      
      Built locally.
      
      Author: Gregory Hart <greg.hart@thinkbiganalytics.com>
      
      Closes #12758 from freastro/hotfix/SPARK-14965.
      12c360c0
Loading