Skip to content
Snippets Groups Projects
  1. Apr 30, 2016
    • pshearer's avatar
      [SPARK-13973][PYSPARK] Make pyspark fail noisily if IPYTHON or IPYTHON_OPTS are set · 0368ff30
      pshearer authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-13973
      
      Following discussion with srowen the IPYTHON and IPYTHON_OPTS variables are removed. If they are set in the user's environment, pyspark will not execute and prints an error message. Failing noisily will force users to remove these options and learn the new configuration scheme, which is much more sustainable and less confusing.
      
      ## How was this patch tested?
      
      Manual testing; set IPYTHON=1 and verified that the error message prints.
      
      Author: pshearer <pshearer@massmutual.com>
      Author: shearerp <shearerp@umich.edu>
      
      Closes #12528 from shearerp/master.
      0368ff30
    • Reynold Xin's avatar
      [SPARK-15028][SQL] Remove HiveSessionState.setDefaultOverrideConfs · 8dc3987d
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes some code that are no longer relevant -- mainly HiveSessionState.setDefaultOverrideConfs.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12806 from rxin/SPARK-15028.
      8dc3987d
    • Xiangrui Meng's avatar
      [SPARK-14831][.2][ML][R] rename ml.save/ml.load to write.ml/read.ml · b3ea5793
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      Continue the work of #12789 to rename ml.asve/ml.load to write.ml/read.ml, which are more consistent with read.df/write.df and other methods in SparkR.
      
      I didn't rename `data` to `df` because we still use `predict` for prediction, which uses `newData` to match the signature in R.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      cc: yanboliang thunterdb
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #12807 from mengxr/SPARK-14831.
      b3ea5793
    • Xiangrui Meng's avatar
      [SPARK-14412][.2][ML] rename *RDDStorageLevel to *StorageLevel in ml.ALS · 7fbe1bb2
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      As discussed in #12660, this PR renames
      * intermediateRDDStorageLevel -> intermediateStorageLevel
      * finalRDDStorageLevel -> finalStorageLevel
      
      The argument name in `ALS.train` will be addressed in SPARK-15027.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #12803 from mengxr/SPARK-14412.
      7fbe1bb2
    • Sean Owen's avatar
      [SPARK-14533][MLLIB] RowMatrix.computeCovariance inaccurate when values are... · 5886b621
      Sean Owen authored
      [SPARK-14533][MLLIB] RowMatrix.computeCovariance inaccurate when values are very large (partial fix)
      
      ## What changes were proposed in this pull request?
      
      Fix for part of SPARK-14533: trivial simplification and more accurate computation of column means. See also https://github.com/apache/spark/pull/12299 which contained a complete fix that was very slow. This PR does _not_ resolve SPARK-14533 entirely.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #12779 from srowen/SPARK-14533.2.
      5886b621
    • Dongjoon Hyun's avatar
      [MINOR][EXAMPLE] Use SparkSession instead of SQLContext in RDDRelation.scala · f86f7176
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Now, `SQLContext` is used for backward-compatibility, we had better use `SparkSession` in Spark 2.0 examples.
      
      ## How was this patch tested?
      
      It's just example change. After building, run `bin/run-example org.apache.spark.examples.sql.RDDRelation`.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12808 from dongjoon-hyun/rddrelation.
      f86f7176
    • Xiangrui Meng's avatar
      [SPARK-14850][.2][ML] use UnsafeArrayData.fromPrimitiveArray in ml.VectorUDT/MatrixUDT · 3d09ceee
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      This PR uses `UnsafeArrayData.fromPrimitiveArray` to implement `ml.VectorUDT/MatrixUDT` to avoid boxing/unboxing.
      
      ## How was this patch tested?
      
      Exiting unit tests.
      
      cc: cloud-fan
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #12805 from mengxr/SPARK-14850.
      3d09ceee
    • Marcelo Vanzin's avatar
      [SPARK-14391][LAUNCHER] Fix launcher communication test, take 2. · 73c20bf3
      Marcelo Vanzin authored
      There's actually a race here: the state of the handler was changed before
      the connection was set, so the test code could be notified of the state
      change, wake up, and still see the connection as null, triggering the assert.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #12785 from vanzin/SPARK-14391.
      73c20bf3
    • Timothy Hunter's avatar
      [SPARK-14831][SPARKR] Make the SparkR MLlib API more consistent with Spark · bc36fe6e
      Timothy Hunter authored
      ## What changes were proposed in this pull request?
      
      This PR splits the MLlib algorithms into two flavors:
       - the R flavor, which tries to mimic the existing R API for these algorithms (and works as an S4 specialization for Spark dataframes)
       - the Spark flavor, which follows the same API and naming conventions as the rest of the MLlib algorithms in the other languages
      
      In practice, the former calls the latter.
      
      ## How was this patch tested?
      
      The tests for the various algorithms were adapted to be run against both interfaces.
      
      Author: Timothy Hunter <timhunter@databricks.com>
      
      Closes #12789 from thunterdb/14831.
      bc36fe6e
    • Wenchen Fan's avatar
      [SPARK-14850][ML] convert primitive array from/to unsafe array directly in VectorUDT/MatrixUDT · 43b149fb
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR adds `fromPrimitiveArray` and `toPrimitiveArray` in `UnsafeArrayData`, so that we can do the conversion much faster in VectorUDT/MatrixUDT.
      
      ## How was this patch tested?
      
      existing tests and new test suite `UnsafeArraySuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12640 from cloud-fan/ml.
      43b149fb
    • hyukjinkwon's avatar
      [SPARK-13667][SQL] Support for specifying custom date format for date and... · 4bac703e
      hyukjinkwon authored
      [SPARK-13667][SQL] Support for specifying custom date format for date and timestamp types at CSV datasource.
      
      ## What changes were proposed in this pull request?
      
      This PR adds the support to specify custom date format for `DateType` and `TimestampType`.
      
      For `TimestampType`, this uses the given format to infer schema and also to convert the values
      For `DateType`, this uses the given format to convert the values.
      If the `dateFormat` is not given, then it works with `DateTimeUtils.stringToTime()` for backwords compatibility.
      When it's given, then it uses `SimpleDateFormat` for parsing data.
      
      In addition, `IntegerType`, `DoubleType` and `LongType` have a higher priority than `TimestampType` in type inference. This means even if the given format is `yyyy` or `yyyy.MM`, it will be inferred as `IntegerType` or `DoubleType`. Since it is type inference, I think it is okay to give such precedences.
      
      In addition, I renamed `csv.CSVInferSchema` to `csv.InferSchema` as JSON datasource has `json.InferSchema`. Although they have the same names, I did this because I thought the parent package name can still differentiate each.  Accordingly, the suite name was also changed from `CSVInferSchemaSuite` to `InferSchemaSuite`.
      
      ## How was this patch tested?
      
      unit tests are used and `./dev/run_tests` for coding style tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11550 from HyukjinKwon/SPARK-13667.
      4bac703e
    • Yin Huai's avatar
      [SPARK-14591][SQL] Remove DataTypeParser and add more keywords to the nonReserved list. · ac41fc64
      Yin Huai authored
      ## What changes were proposed in this pull request?
      CatalystSqlParser can parse data types. So, we do not need to have an individual DataTypeParser.
      
      ## How was this patch tested?
      Existing tests
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #12796 from yhuai/removeDataTypeParser.
      ac41fc64
    • Reynold Xin's avatar
      [SPARK-14757] [SQL] Fix nullability bug in EqualNullSafe codegen · 7945f9f6
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch fixes a null handling bug in EqualNullSafe's code generation.
      
      ## How was this patch tested?
      Updated unit test so they would fail without the fix.
      
      Closes #12628.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Arash Nabili <arash@levyx.com>
      
      Closes #12799 from rxin/equalnullsafe.
      7945f9f6
    • Nick Pentreath's avatar
      [SPARK-14412][ML][PYSPARK] Add StorageLevel params to ALS · 90fa2c6e
      Nick Pentreath authored
      `mllib` `ALS` supports `setIntermediateRDDStorageLevel` and `setFinalRDDStorageLevel`. This PR adds these as Params in `ml` `ALS`. They are put in group **expertParam** since few users will need them.
      
      ## How was this patch tested?
      
      New test cases in `ALSSuite` and `tests.py`.
      
      cc yanboliang jkbradley sethah rishabhbhardwaj
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #12660 from MLnick/SPARK-14412-als-storage-params.
      90fa2c6e
  2. Apr 29, 2016
    • hyukjinkwon's avatar
      [SPARK-14917][SQL] Enable some ORC compressions tests for writing · d7755cfd
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-14917
      
      As it is described in the JIRA, it seems Hive 1.2.1 which Spark uses now supports snappy and none.
      
      So, this PR enables some tests for writing ORC files with compression codes, `SNAPPY` and `NONE`.
      
      ## How was this patch tested?
      
      Unittests in `OrcQuerySuite` and `sbt scalastyle`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #12699 from HyukjinKwon/SPARK-14917.
      d7755cfd
    • Joseph K. Bradley's avatar
      [SPARK-13786][ML][PYTHON] Removed save/load for python tuning · 09da43d5
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Per discussion on [https://github.com/apache/spark/pull/12604], this removes ML persistence for Python tuning (TrainValidationSplit, CrossValidator, and their Models) since they do not handle nesting easily.  This support should be re-designed and added in the next release.
      
      ## How was this patch tested?
      
      Removed unit test elements saving and loading the tuning algorithms, but kept tests to save and load their bestModel fields.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12782 from jkbradley/remove-python-tuning-saveload.
      09da43d5
    • Andrew Or's avatar
      [SPARK-15012][SQL] Simplify configuration API further · 66773eb8
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      1. Remove all the `spark.setConf` etc. Just expose `spark.conf`
      2. Make `spark.conf` take in things set in the core `SparkConf` as well, otherwise users may get confused
      
      This was done for both the Python and Scala APIs.
      
      ## How was this patch tested?
      `SQLConfSuite`, python tests.
      
      This one fixes the failed tests in #12787
      
      Closes #12787
      
      Author: Andrew Or <andrew@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #12798 from yhuai/conf-api.
      66773eb8
    • Wenchen Fan's avatar
      [SPARK-15010][CORE] new accumulator shoule be tolerant of local RPC message delivery · b056e8cb
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      The RPC framework will not serialize and deserialize messages in local mode, we should not call `acc.value` when receive heartbeat message, because the serialization hook of new accumulator may not be triggered and the `atDriverSide` flag may not be set.
      
      ## How was this patch tested?
      
      tested it locally via spark shell
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12795 from cloud-fan/bug.
      b056e8cb
    • Yin Huai's avatar
      [SPARK-15019][SQL] Propagate all Spark Confs to HiveConf created in HiveClientImpl · b33d6b72
      Yin Huai authored
      ## What changes were proposed in this pull request?
      This PR makes two changes:
      1. We will propagate Spark Confs to HiveConf created in HiveClientImpl. So, users can also use spark conf to set warehouse location and metastore url.
      2. In sql/hive, HiveClientImpl will be the only place where we create a new HiveConf.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #12791 from yhuai/onlyUseHiveConfInHiveClientImpl.
      b33d6b72
    • tedyu's avatar
      [SPARK-15003] Use ConcurrentHashMap in place of HashMap for NewAccumulator.originals · dcfaeade
      tedyu authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to use ConcurrentHashMap in place of HashMap for NewAccumulator.originals
      
      This should result in better performance.
      
      ## How was this patch tested?
      
      Existing unit test suite
      
      cloud-fan
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #12776 from tedyu/master.
      dcfaeade
    • Herman van Hovell's avatar
      [SPARK-14858] [SQL] Enable subquery pushdown · 83061be6
      Herman van Hovell authored
      The previous subquery PRs did not include support for pushing subqueries used in filters (`WHERE`/`HAVING`) down. This PR adds this support. For example :
      ```scala
      range(0, 10).registerTempTable("a")
      range(5, 15).registerTempTable("b")
      range(7, 25).registerTempTable("c")
      range(3, 12).registerTempTable("d")
      val plan = sql("select * from a join b on a.id = b.id left join c on c.id = b.id where a.id in (select id from d)")
      plan.explain(true)
      ```
      Leads to the following Analyzed & Optimized plans:
      ```
      == Parsed Logical Plan ==
      ...
      
      == Analyzed Logical Plan ==
      id: bigint, id: bigint, id: bigint
      Project [id#0L,id#4L,id#8L]
      +- Filter predicate-subquery#16 [(id#0L = id#12L)]
         :  +- SubqueryAlias predicate-subquery#16 [(id#0L = id#12L)]
         :     +- Project [id#12L]
         :        +- SubqueryAlias d
         :           +- Range 3, 12, 1, 8, [id#12L]
         +- Join LeftOuter, Some((id#8L = id#4L))
            :- Join Inner, Some((id#0L = id#4L))
            :  :- SubqueryAlias a
            :  :  +- Range 0, 10, 1, 8, [id#0L]
            :  +- SubqueryAlias b
            :     +- Range 5, 15, 1, 8, [id#4L]
            +- SubqueryAlias c
               +- Range 7, 25, 1, 8, [id#8L]
      
      == Optimized Logical Plan ==
      Join LeftOuter, Some((id#8L = id#4L))
      :- Join Inner, Some((id#0L = id#4L))
      :  :- Join LeftSemi, Some((id#0L = id#12L))
      :  :  :- Range 0, 10, 1, 8, [id#0L]
      :  :  +- Range 3, 12, 1, 8, [id#12L]
      :  +- Range 5, 15, 1, 8, [id#4L]
      +- Range 7, 25, 1, 8, [id#8L]
      
      == Physical Plan ==
      ...
      ```
      I have also taken the opportunity to move quite a bit of code around:
      - Rewriting subqueris and pulling out correlated predicated from subqueries has been moved into the analyzer. The analyzer transforms `Exists` and `InSubQuery` into `PredicateSubquery` expressions. A PredicateSubquery exposes the 'join' expressions and the proper references. This makes things like type coercion, optimization and planning easier to do.
      - I have added support for `Aggregate` plans in subqueries. Any correlated expressions will be added to the grouping expressions. I have removed support for `Union` plans, since pulling in an outer reference from beneath a Union has no value (a filtered value could easily be part of another Union child).
      - Resolution of subqueries is now done using `OuterReference`s. These are used to wrap any outer reference; this makes the identification of these references easier, and also makes dealing with duplicate attributes in the outer and inner plans easier. The resolution of subqueries initially used a resolution loop which would alternate between calling the analyzer and trying to resolve the outer references. We now use a dedicated analyzer which uses a special rule for outer reference resolution.
      
      These changes are a stepping stone for enabling correlated scalar subqueries, enabling all Hive tests & allowing us to use predicate subqueries anywhere.
      
      Current tests and added test cases in FilterPushdownSuite.
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #12720 from hvanhovell/SPARK-14858.
      83061be6
    • Joseph K. Bradley's avatar
      [SPARK-14646][ML] Modified Kmeans to store cluster centers with one per row · 1eda2f10
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Modified Kmeans to store cluster centers with one per row
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12792 from jkbradley/kmeans-save-fix.
      1eda2f10
    • Andrew Or's avatar
      [SPARK-14988][PYTHON] SparkSession API follow-ups · d33e3d57
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Addresses comments in #12765.
      
      ## How was this patch tested?
      
      Python tests.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12784 from andrewor14/python-followup.
      d33e3d57
    • Sun Rui's avatar
      [SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR. · 4ae9fe09
      Sun Rui authored
      ## What changes were proposed in this pull request?
      
      dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame.
      
      The function signature is:
      
      	dapply(df, function(localDF) {}, schema = NULL)
      
      R function input: local data.frame from the partition on local node
      R function output: local data.frame
      
      Schema specifies the Row format of the resulting DataFrame. It must match the R function's output.
      If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply().
      
      ## How was this patch tested?
      SparkR unit tests.
      
      Author: Sun Rui <rui.sun@intel.com>
      Author: Sun Rui <sunrui2016@gmail.com>
      
      Closes #12493 from sun-rui/SPARK-12919.
      4ae9fe09
    • BenFradet's avatar
      [SPARK-14570][ML] Log instrumentation in Random forests · d78fbcc3
      BenFradet authored
      ## What changes were proposed in this pull request?
      
      Added Instrumentation logging to DecisionTree{Classifier,Regressor} and RandomForest{Classifier,Regressor}
      
      ## How was this patch tested?
      
      No tests involved since it's logging related.
      
      Author: BenFradet <benjamin.fradet@gmail.com>
      
      Closes #12536 from BenFradet/SPARK-14570.
      d78fbcc3
    • Yin Huai's avatar
      [SPARK-15013][SQL] Remove hiveConf from HiveSessionState · af32f4ae
      Yin Huai authored
      ## What changes were proposed in this pull request?
      The hiveConf in HiveSessionState is not actually used anymore. Let's remove it.
      
      ## How was this patch tested?
      Existing tests
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #12786 from yhuai/removeHiveConf.
      af32f4ae
    • Cheng Lian's avatar
      [SPARK-14981][SQL] Throws exception if DESC is specified for sorting columns · a04b1de5
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      Currently Spark SQL doesn't support sorting columns in descending order. However, the parser accepts the syntax and silently drops sorting directions. This PR fixes this by throwing an exception if `DESC` is specified as sorting direction of a sorting column.
      
      ## How was this patch tested?
      
      A test case is added to test the invalid sorting order by checking exception message.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #12759 from liancheng/spark-14981.
      a04b1de5
    • Reynold Xin's avatar
      [SPARK-15004][SQL] Remove zookeeper service discovery code in thrift-server · 8ebae466
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We recently inlined Hive's thrift server code in SPARK-15004. This patch removes the code related to zookeeper service discovery, Tez, and Hive on Spark, since they are irrelevant.
      
      ## How was this patch tested?
      N/A - removing dead code
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12780 from rxin/SPARK-15004.
      8ebae466
    • Yin Huai's avatar
      [SPARK-15011][SQL][TEST] Ignore org.apache.spark.sql.hive.StatisticsSuite.analyze MetastoreRelation · ac115f66
      Yin Huai authored
      This test always fail with sbt's hadoop 2.3 and 2.4 tests. Let'e disable it for now and investigate the problem.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #12783 from yhuai/SPARK-15011-ignore.
      ac115f66
    • Jeff Zhang's avatar
      [SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA PR2 · 775772de
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      pyspark.ml API for LDA
      * LDA, LDAModel, LocalLDAModel, DistributedLDAModel
      * includes persistence
      
      This replaces [https://github.com/apache/spark/pull/10242]
      
      ## How was this patch tested?
      
      * doc test for LDA, including Param setters
      * unit test for persistence
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #12723 from jkbradley/zjffdu-SPARK-11940.
      775772de
    • Joseph K. Bradley's avatar
      [SPARK-14984][ML] Deprecated model field in LinearRegressionSummary · f08dcdb8
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Deprecated model field in LinearRegressionSummary
      
      Removed unnecessary Since annotations
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12763 from jkbradley/lr-summary-api.
      f08dcdb8
    • Yanbo Liang's avatar
      [SPARK-14314][SPARK-14315][ML][SPARKR] Model persistence in SparkR (glm & kmeans) · 87ac84d4
      Yanbo Liang authored
      SparkR ```glm``` and ```kmeans``` model persistence.
      
      Unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      Author: Gayathri Murali <gayathri.m.softie@gmail.com>
      
      Closes #12778 from yanboliang/spark-14311.
      Closes #12680
      Closes #12683
      87ac84d4
    • Andrew Or's avatar
      [SPARK-14988][PYTHON] SparkSession catalog and conf API · a7d0fedc
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      The `catalog` and `conf` APIs were exposed in `SparkSession` in #12713 and #12669. This patch adds those to the python API.
      
      ## How was this patch tested?
      
      Python tests.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12765 from andrewor14/python-spark-session-more.
      a7d0fedc
    • Davies Liu's avatar
      [SPARK-14987][SQL] inline hive-service (cli) into sql/hive-thriftserver · 7feeb82c
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR copy the thrift-server from hive-service-1.2 (including  TCLIService.thrift and generated Java source code) into sql/hive-thriftserver, so we can do further cleanup and improvements.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12764 from davies/thrift_server.
      7feeb82c
    • wm624@hotmail.com's avatar
      [SPARK-14571][ML] Log instrumentation in ALS · b6fa7e59
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      Add log instrumentation for parameters:
      rank, numUserBlocks, numItemBlocks, implicitPrefs, alpha,
      userCol, itemCol, ratingCol, predictionCol, maxIter,
      regParam, nonnegative, checkpointInterval, seed
      
      Add log instrumentation for numUserFeatures and numItemFeatures
      
      ## How was this patch tested?
      
      Manual test: Set breakpoint in intellij and run def testALS(). Single step debugging and check the log method is called.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #12560 from wangmiao1981/log.
      b6fa7e59
    • dding3's avatar
      [SPARK-14969][MLLIB] Remove duplicate implementation of compute in LogisticGradient · 6d5aeaae
      dding3 authored
      ## What changes were proposed in this pull request?
      
      This PR removes duplicate implementation of compute in LogisticGradient class
      
      ## How was this patch tested?
      
      unit tests
      
      Author: dding3 <dingding@dingding-ubuntu.sh.intel.com>
      
      Closes #12747 from dding3/master.
      6d5aeaae
    • Jakob Odersky's avatar
      [SPARK-14511][BUILD] Upgrade genjavadoc to latest upstream · 7226e190
      Jakob Odersky authored
      ## What changes were proposed in this pull request?
      In the past, genjavadoc had issues with package private members which led the spark project to use a forked version. This issue has been fixed upstream (typesafehub/genjavadoc#70) and a release is available for scala versions 2.10, 2.11 **and 2.12**, hence a forked version for spark is no longer necessary.
      This pull request updates the build configuration to use the newest upstream genjavadoc.
      
      ## How was this patch tested?
      The build was run `sbt unidoc`. During the process javadoc emits some errors on the generated java stubs, however these errors were also present before the upgrade. Furthermore, the produced html is fine.
      
      Author: Jakob Odersky <jakob@odersky.com>
      
      Closes #12707 from jodersky/SPARK-14511-genjavadoc.
      7226e190
    • Reynold Xin's avatar
      [SPARK-14994][SQL] Remove execution hive from HiveSessionState · 054f991c
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes executionHive from HiveSessionState and HiveSharedState.
      
      ## How was this patch tested?
      Updated test cases.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #12770 from rxin/SPARK-14994.
      054f991c
    • Sameer Agarwal's avatar
      [SPARK-14996][SQL] Add TPCDS Benchmark Queries for SparkSQL · 2057cbcb
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This PR adds support for easily running and benchmarking a set of common TPCDS queries locally in SparkSQL.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #12771 from sameeragarwal/tpcds-2.
      2057cbcb
    • gatorsmile's avatar
      [SPARK-12660][SPARK-14967][SQL] Implement Except Distinct by Left Anti Join · 222dcf79
      gatorsmile authored
      #### What changes were proposed in this pull request?
      Replaces a logical `Except` operator with a `Left-anti Join` operator. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).
      ```SQL
        SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2
        ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
      ```
       Note:
       1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL.
       2. This rule has to be done after de-duplicating the attributes; otherwise, the enerated
          join conditions will be incorrect.
      
      This PR also corrects the existing behavior in Spark. Before this PR, the behavior is like
      ```SQL
        test("except") {
          val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id")
          val df_right = Seq(1, 3).toDF("id")
      
          checkAnswer(
            df_left.except(df_right),
            Row(2) :: Row(2) :: Row(4) :: Nil
          )
        }
      ```
      After this PR, the result is corrected. We strictly follow the SQL compliance of `Except Distinct`.
      
      #### How was this patch tested?
      Modified and added a few test cases to verify the optimization rule and the results of operators.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #12736 from gatorsmile/exceptByAntiJoin.
      222dcf79
Loading