Skip to content
Snippets Groups Projects
  1. Apr 16, 2016
    • Andrew Or's avatar
      [SPARK-14672][SQL] Move HiveContext analyze logic to AnalyzeTable · 3394b12c
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Move the implementation of `hiveContext.analyze` to the command of `AnalyzeTable`.
      
      ## How was this patch tested?
      Existing tests.
      
      Closes #12429
      
      Author: Yin Huai <yhuai@databricks.com>
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12448 from yhuai/analyzeTable.
      3394b12c
    • Andrew Or's avatar
      [SPARK-14647][SQL] Group SQLContext/HiveContext state into SharedState · 5cefecc9
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      This patch adds a SharedState that groups state shared across multiple SQLContexts. This is analogous to the SessionState added in SPARK-13526 that groups session-specific state. This cleanup makes the constructors of the contexts simpler and ultimately allows us to remove HiveContext in the near future.
      
      ## How was this patch tested?
      Existing tests.
      
      Closes #12405
      
      Author: Andrew Or <andrew@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #12447 from yhuai/sharedState.
      5cefecc9
    • 杨博 (Yang Bo)'s avatar
      [SPARK-14683][DOCUMENTATION] Configure external links in ScalaDoc · 3f49afee
      杨博 (Yang Bo) authored
      Right now Spark's Scaladoc does not link to Scala standard library and other dependencies. This would bother Spark starters because they may be not experienced Scala programmers.
      
      This patch fixes these links in ScalaDoc.
      
      Author: 杨博 (Yang Bo) <pop.atry@gmail.com>
      
      Closes #12444 from Atry/patch-1.
      3f49afee
    • Reynold Xin's avatar
      [SPARK-14677][SQL] follow up: make max iter num config internal · 7319fcc1
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This is a follow-up to make the max iteration number an internal config.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12441 from rxin/maxIterConfInternal.
      7319fcc1
    • Joseph K. Bradley's avatar
      [SPARK-14605][ML][PYTHON] Changed Python to use unicode UIDs for spark.ml Identifiable · 36da5e32
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Python spark.ml Identifiable classes use UIDs of type str, but they should use unicode (in Python 2.x) to match Java. This could be a problem if someone created a class in Java with odd unicode characters, saved it, and loaded it in Python.
      
      This PR: Use unicode everywhere in Python.
      
      ## How was this patch tested?
      
      Updated persistence unit test to check uid type
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12368 from jkbradley/python-uid-unicode.
      36da5e32
    • hyukjinkwon's avatar
      [MINOR] Remove inappropriate type notation and extra anonymous closure within... · 9f678e97
      hyukjinkwon authored
      [MINOR] Remove inappropriate type notation and extra anonymous closure within functional transformations
      
      ## What changes were proposed in this pull request?
      
      This PR removes
      
      - Inappropriate type notations
          For example, from
          ```scala
          words.foreachRDD { (rdd: RDD[String], time: Time) =>
          ...
          ```
          to
          ```scala
          words.foreachRDD { (rdd, time) =>
          ...
          ```
      
      - Extra anonymous closure within functional transformations.
          For example,
          ```scala
          .map(item => {
            ...
          })
          ```
      
          which can be just simply as below:
      
          ```scala
          .map { item =>
            ...
          }
          ```
      
      and corrects some obvious style nits.
      
      ## How was this patch tested?
      
      This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly.
      
      The rules applied were below:
      
      - For the first correction,
      
      ```xml
      <check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
          <parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]*\(\s*[^,]+s*=>\s*\{[^\}]+\}\s*\)</parameter></parameters>
      </check>
      ```
      
      ```xml
      <check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
          <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]([^\n>,]+=>)?\s*\{([^()]|(?R))*\}^[,]</parameter></parameters>
      </check>
      ```
      
      - For the second correction
      ```xml
      <check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
          <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]\s*\([^):]*:R))*\}^[,]</parameter></parameters>
      </check>
      ```
      
      **Those rules were not added**
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #12413 from HyukjinKwon/SPARK-style.
      9f678e97
    • Reynold Xin's avatar
      527c780b
    • Wenchen Fan's avatar
      [SPARK-13363][SQL] support Aggregator in RelationalGroupedDataset · 12854464
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      set the input encoder for `TypedColumn` in `RelationalGroupedDataset.agg`.
      
      ## How was this patch tested?
      
      new tests in `DatasetAggregatorSuite`
      
      close https://github.com/apache/spark/pull/11269
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12359 from cloud-fan/agg.
      12854464
  2. Apr 15, 2016
    • Reynold Xin's avatar
      [SPARK-14677][SQL] Make the max number of iterations configurable for Catalyst · f4be0946
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We currently hard code the max number of optimizer/analyzer iterations to 100. This patch makes it configurable. While I'm at it, I also added the SessionCatalog to the optimizer, so we can use information there in optimization.
      
      ## How was this patch tested?
      Updated unit tests to reflect the change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12434 from rxin/SPARK-14677.
      f4be0946
    • Yin Huai's avatar
      [SPARK-14668][SQL] Move CurrentDatabase to Catalyst · b2dfa849
      Yin Huai authored
      ## What changes were proposed in this pull request?
      
      This PR moves `CurrentDatabase` from sql/hive package to sql/catalyst. It also adds the function description, which looks like the following.
      
      ```
      scala> sqlContext.sql("describe function extended current_database").collect.foreach(println)
      [Function: current_database]
      [Class: org.apache.spark.sql.execution.command.CurrentDatabase]
      [Usage: current_database() - Returns the current database.]
      [Extended Usage:
      > SELECT current_database()]
      ```
      
      ## How was this patch tested?
      Existing tests
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #12424 from yhuai/SPARK-14668.
      b2dfa849
    • Sameer Agarwal's avatar
      [SPARK-14620][SQL] Use/benchmark a better hash in VectorizedHashMap · 4df65184
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This PR uses a better hashing algorithm while probing the AggregateHashMap:
      
      ```java
      long h = 0
      h = (h ^ (0x9e3779b9)) + key_1 + (h << 6) + (h >>> 2);
      h = (h ^ (0x9e3779b9)) + key_2 + (h << 6) + (h >>> 2);
      h = (h ^ (0x9e3779b9)) + key_3 + (h << 6) + (h >>> 2);
      ...
      h = (h ^ (0x9e3779b9)) + key_n + (h << 6) + (h >>> 2);
      return h
      ```
      
      Depends on: https://github.com/apache/spark/pull/12345
      ## How was this patch tested?
      
          Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4
          Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
          Aggregate w keys:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
          -------------------------------------------------------------------------------------------
          codegen = F                              2417 / 2457          8.7         115.2       1.0X
          codegen = T hashmap = F                  1554 / 1581         13.5          74.1       1.6X
          codegen = T hashmap = T                   877 /  929         23.9          41.8       2.8X
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #12379 from sameeragarwal/hash.
      4df65184
    • Reynold Xin's avatar
      [SPARK-14628][CORE] Simplify task metrics by always tracking read/write metrics · 8028a288
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      
      Part of the reason why TaskMetrics and its callers are complicated are due to the optional metrics we collect, including input, output, shuffle read, and shuffle write. I think we can always track them and just assign 0 as the initial values. It is usually very obvious whether a task is supposed to read any data or not. By always tracking them, we can remove a lot of map, foreach, flatMap, getOrElse(0L) calls throughout Spark.
      
      This patch also changes a few behaviors.
      
      1. Removed the distinction of data read/write methods (e.g. Hadoop, Memory, Network, etc).
      2. Accumulate all data reads and writes, rather than only the first method. (Fixes SPARK-5225)
      
      ## How was this patch tested?
      
      existing tests.
      
      This is bases on https://github.com/apache/spark/pull/12388, with more test fixes.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12417 from cloud-fan/metrics-refactor.
      8028a288
    • Xusen Yin's avatar
      [SPARK-7861][ML] PySpark OneVsRest · 90b46e01
      Xusen Yin authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-7861
      
      Add PySpark OneVsRest. I implement it with Python since it's a meta-pipeline.
      
      ## How was this patch tested?
      
      Test with doctest.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #12124 from yinxusen/SPARK-14306-7861.
      90b46e01
    • sethah's avatar
      [SPARK-14104][PYSPARK][ML] All Python param setters should use the `_set` method · 129f2f45
      sethah authored
      ## What changes were proposed in this pull request?
      
      Param setters in python previously accessed the _paramMap directly to update values. The `_set` method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to `_paramMap` besides the one in the `_set` method to ensure type checking happens.
      
      Additional changes:
      * [SPARK-13068](https://github.com/apache/spark/pull/11663) missed adding type converters in evaluation.py so those are done here
      * An incorrect `toBoolean` type converter was used for StringIndexer `handleInvalid` param in previous PR. This is fixed here.
      
      ## How was this patch tested?
      
      Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #11939 from sethah/SPARK-14104.
      129f2f45
    • Joseph K. Bradley's avatar
      [SPARK-14665][ML][PYTHON] Fixed bug with StopWordsRemover default stopwords · d6ae7d46
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      The default stopwords were a Java object.  They are no longer.
      
      ## How was this patch tested?
      
      Unit test which failed before the fix
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12422 from jkbradley/pyspark-stopwords.
      d6ae7d46
    • Yanbo Liang's avatar
      [SPARK-13925][ML][SPARKR] Expose R-like summary statistics in SparkR::glm for... · 83af297a
      Yanbo Liang authored
      [SPARK-13925][ML][SPARKR] Expose R-like summary statistics in SparkR::glm for more family and link functions
      
      ## What changes were proposed in this pull request?
      Expose R-like summary statistics in SparkR::glm for more family and link functions.
      Note: Not all values in R [summary.glm](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/summary.glm.html) are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work.
      
      ## How was this patch tested?
      Unit tests.
      
      SparkR Output:
      ```
      Deviance Residuals:
      (Note: These are approximate quantiles with relative error <= 0.01)
           Min        1Q    Median        3Q       Max
      -0.95096  -0.16585  -0.00232   0.17410   0.72918
      
      Coefficients:
                          Estimate  Std. Error  t value  Pr(>|t|)
      (Intercept)         1.6765    0.23536     7.1231   4.4561e-11
      Sepal_Length        0.34988   0.046301    7.5566   4.1873e-12
      Species_versicolor  -0.98339  0.072075    -13.644  0
      Species_virginica   -1.0075   0.093306    -10.798  0
      
      (Dispersion parameter for gaussian family taken to be 0.08351462)
      
          Null deviance: 28.307  on 149  degrees of freedom
      Residual deviance: 12.193  on 146  degrees of freedom
      AIC: 59.22
      
      Number of Fisher Scoring iterations: 1
      ```
      R output:
      ```
      Deviance Residuals:
           Min        1Q    Median        3Q       Max
      -0.95096  -0.16522   0.00171   0.18416   0.72918
      
      Coefficients:
                        Estimate Std. Error t value Pr(>|t|)
      (Intercept)        1.67650    0.23536   7.123 4.46e-11 ***
      Sepal.Length       0.34988    0.04630   7.557 4.19e-12 ***
      Speciesversicolor -0.98339    0.07207 -13.644  < 2e-16 ***
      Speciesvirginica  -1.00751    0.09331 -10.798  < 2e-16 ***
      ---
      Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
      
      (Dispersion parameter for gaussian family taken to be 0.08351462)
      
          Null deviance: 28.307  on 149  degrees of freedom
      Residual deviance: 12.193  on 146  degrees of freedom
      AIC: 59.217
      
      Number of Fisher Scoring iterations: 2
      ```
      
      cc mengxr
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12393 from yanboliang/spark-13925.
      83af297a
    • Peter Ableda's avatar
      [SPARK-14633] Use more readable format to show memory bytes in Error Message · 06b9d623
      Peter Ableda authored
      ## What changes were proposed in this pull request?
      
      Round memory bytes and convert it to Long to it’s original type. This change fixes the formatting issue in the Exception message.
      
      ## How was this patch tested?
      
      Manual tests were done in CDH cluster.
      
      Author: Peter Ableda <peter.ableda@cloudera.com>
      
      Closes #12392 from peterableda/SPARK-14633.
      06b9d623
    • Pravin Gadakh's avatar
      [SPARK-14370][MLLIB] removed duplicate generation of ids in OnlineLDAOptimizer · e2492326
      Pravin Gadakh authored
      ## What changes were proposed in this pull request?
      
      Removed duplicated generation of `ids` in OnlineLDAOptimizer.
      
      ## How was this patch tested?
      
      tested with existing unit tests.
      
      Author: Pravin Gadakh <prgadakh@in.ibm.com>
      
      Closes #12176 from pravingadakh/SPARK-14370.
      e2492326
    • DB Tsai's avatar
      [SPARK-14549][ML] Copy the Vector and Matrix classes from mllib to ml in mllib-local · 96534aa4
      DB Tsai authored
      ## What changes were proposed in this pull request?
      
      This task will copy the Vector and Matrix classes from mllib to ml package in mllib-local jar. The UDTs and `since` annotation in ml vector and matrix will be removed from now. UDTs will be achieved by #SPARK-14487, and `since` will be replaced by /*  since 1.2.0 */
      
      The BLAS implementation will be copied, and some of the test utilities will be copies as well.
      
      Summary of changes:
      
      1. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/BLAS.scala
        - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/BLAS.scala
        - logDebug("gemm: alpha is equal to 0 and beta is equal to 1. Returning C.") is removed in ml version.
      2. In  mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Matrices.scala
        - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Matrices.scala
        - `Since` was removed, and we'll use standard `/* Since /*` Java doc. Will be in another PR.
        - `UDT` related code was removed, and will use `SPARK-13944` https://github.com/apache/spark/pull/12259  to replace the annotation.
      3. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Vectors.scala
        - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
        - `Since` was removed.
        - `UDT` related code was removed.
        - In `def parseNumeric`, it was throwing `throw new SparkException(s"Cannot parse $other.")`, and now it's throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
      4. In mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
        - For consistency with ML version of vector, `def parseNumeric` is now throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
      5. mllib/src/main/scala/org/apache/spark/**mllib**/util/NumericParser.scala is moved to mllib-local/src/main/scala/org/apache/spark/**ml**/util/NumericParser.scala
        - All the `throw new SparkException` were replaced by `throw new IllegalArgumentException`
      
      ## How was this patch tested?
      
      unit tests
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #12317 from dbtsai/dbtsai-ml-vector.
      96534aa4
    • Reynold Xin's avatar
      Closes #12407 · a9324a06
      Reynold Xin authored
      Closes #12408
      Closes #12401
      a9324a06
  3. Apr 14, 2016
    • Yanbo Liang's avatar
      [SPARK-14374][ML][PYSPARK] PySpark ml GBTClassifier, Regressor support export/import · b9613239
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      PySpark ml GBTClassifier, Regressor support export/import.
      
      ## How was this patch tested?
      Doc test.
      
      cc jkbradley
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12383 from yanboliang/spark-14374.
      b9613239
    • Wenchen Fan's avatar
      [SPARK-14275][SQL] Reimplement TypedAggregateExpression to DeclarativeAggregate · 297ba3f1
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      `ExpressionEncoder` is just a container for serialization and deserialization expressions, we can use these expressions to build `TypedAggregateExpression` directly, so that it can fit in `DeclarativeAggregate`, which is more efficient.
      
      One trick is, for each buffer serializer expression, it will reference to the result object of serialization and function call. To avoid re-calculating this result object, we can serialize the buffer object to a single struct field, so that we can use a special `Expression` to only evaluate result object once.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12067 from cloud-fan/typed_udaf.
      297ba3f1
    • Sameer Agarwal's avatar
      [SPARK-14447][SQL] Speed up TungstenAggregate w/ keys using VectorizedHashMap · b5c60bcd
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This patch speeds up group-by aggregates by around 3-5x by leveraging an in-memory `AggregateHashMap` (please see https://github.com/apache/spark/pull/12161), an append-only aggregate hash map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates (and fall back to the `BytesToBytesMap` if a given key isn't found).
      
      Architecturally, it is backed by a power-of-2-sized array for index lookups and a columnar batch that stores the key-value pairs. The index lookups in the array rely on linear probing (with a small number of maximum tries) and use an inexpensive hash function which makes it really efficient for a majority of lookups. However, using linear probing and an inexpensive hash function also makes it less robust as compared to the `BytesToBytesMap` (especially for a large number of keys or even for certain distribution of keys) and requires us to fall back on the latter for correctness.
      
      ## How was this patch tested?
      
          Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4
          Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
          Aggregate w keys:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
          -------------------------------------------------------------------------------------------
          codegen = F                              2124 / 2204          9.9         101.3       1.0X
          codegen = T hashmap = F                  1198 / 1364         17.5          57.1       1.8X
          codegen = T hashmap = T                   369 /  600         56.8          17.6       5.8X
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #12345 from sameeragarwal/tungsten-aggregate-integration.
      b5c60bcd
    • Mark Grover's avatar
      [SPARK-14601][DOC] Minor doc/usage changes related to removal of Spark assembly · ff9ae61a
      Mark Grover authored
      ## What changes were proposed in this pull request?
      
      Removing references to assembly jar in documentation.
      Adding an additional (previously undocumented) usage of spark-submit to run examples.
      
      ## How was this patch tested?
      
      Ran spark-submit usage to ensure formatting was fine. Ran examples using SparkSubmit.
      
      Author: Mark Grover <mark@apache.org>
      
      Closes #12365 from markgrover/spark-14601.
      ff9ae61a
    • Fokko Driesprong's avatar
      [SPARK-12869] Implemented an improved version of the toIndexedRowMatrix · c80586d9
      Fokko Driesprong authored
      Hi guys,
      
      I've implemented an improved version of the `toIndexedRowMatrix` function on the `BlockMatrix`. I needed this for a project, but would like to share it with the rest of the community. In the case of dense matrices, it can increase performance up to 19 times:
      https://github.com/Fokko/BlockMatrixToIndexedRowMatrix
      
      If there are any questions or suggestions, please let me know. Keep up the good work! Cheers.
      
      Author: Fokko Driesprong <f.driesprong@catawiki.nl>
      Author: Fokko Driesprong <fokko@driesprongen.nl>
      
      Closes #10839 from Fokko/master.
      c80586d9
    • Yong Tang's avatar
      [SPARK-14565][ML] RandomForest should use parseInt and parseDouble for feature... · 01dd1f5c
      Yong Tang authored
      [SPARK-14565][ML] RandomForest should use parseInt and parseDouble for feature subset size instead of regexes
      
      ## What changes were proposed in this pull request?
      
      This fix tries to change RandomForest's supported strategies from using regexes to using parseInt and
      parseDouble, for the purpose of robustness and maintainability.
      
      ## How was this patch tested?
      
      Existing tests passed.
      
      Author: Yong Tang <yong.tang.github@outlook.com>
      
      Closes #12360 from yongtang/SPARK-14565.
      01dd1f5c
    • Dongjoon Hyun's avatar
      [SPARK-14545][SQL] Improve `LikeSimplification` by adding `a%b` rule · d7e124ed
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Current `LikeSimplification` handles the following four rules.
      - 'a%' => expr.StartsWith("a")
      - '%b' => expr.EndsWith("b")
      - '%a%' => expr.Contains("a")
      - 'a' => EqualTo("a")
      
      This PR adds the following rule.
      - 'a%b' => expr.Length() >= 2 && expr.StartsWith("a") && expr.EndsWith("b")
      
      Here, 2 is statically calculated from "a".size + "b".size.
      
      **Before**
      ```
      scala> sql("select a from (select explode(array('abc','adc')) a) T where a like 'a%c'").explain()
      == Physical Plan ==
      WholeStageCodegen
      :  +- Filter a#5 LIKE a%c
      :     +- INPUT
      +- Generate explode([abc,adc]), false, false, [a#5]
         +- Scan OneRowRelation[]
      ```
      
      **After**
      ```
      scala> sql("select a from (select explode(array('abc','adc')) a) T where a like 'a%c'").explain()
      == Physical Plan ==
      WholeStageCodegen
      :  +- Filter ((length(a#5) >= 2) && (StartsWith(a#5, a) && EndsWith(a#5, c)))
      :     +- INPUT
      +- Generate explode([abc,adc]), false, false, [a#5]
         +- Scan OneRowRelation[]
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including new testcase).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12312 from dongjoon-hyun/SPARK-14545.
      d7e124ed
    • Yong Tang's avatar
      [SPARK-14238][ML][MLLIB][PYSPARK] Add binary toggle Param to PySpark HashingTF in ML & MLlib · bc748b7b
      Yong Tang authored
      ## What changes were proposed in this pull request?
      
      This fix tries to add binary toggle Param to PySpark HashingTF in ML & MLlib. If this toggle is set, then all non-zero counts will be set to 1.
      
      Note: This fix (SPARK-14238) is extended from SPARK-13963 where Scala implementation was done.
      
      ## How was this patch tested?
      
      This fix adds two tests to cover the code changes. One for HashingTF in PySpark's ML and one for HashingTF in PySpark's MLLib.
      
      Author: Yong Tang <yong.tang.github@outlook.com>
      
      Closes #12079 from yongtang/SPARK-14238.
      bc748b7b
    • Joseph K. Bradley's avatar
      [SPARK-14618][ML][DOC] Updated RegressionEvaluator.metricName param doc · bf65c87f
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      In Spark 1.4, we negated some metrics from RegressionEvaluator since CrossValidator always maximized metrics. This was fixed in 1.5, but the docs were not updated. This PR updates the docs.
      
      ## How was this patch tested?
      
      no tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12377 from jkbradley/regeval-doc.
      bf65c87f
    • Bryan Cutler's avatar
      [SPARK-13967][PYSPARK][ML] Added binary Param to Python CountVectorizer · c5172f82
      Bryan Cutler authored
      Added binary toggle param to CountVectorizer feature transformer in PySpark.
      
      Created a unit test for using CountVectorizer with the binary toggle on.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #12308 from BryanCutler/binary-param-python-CountVectorizer-SPARK-13967.
      c5172f82
    • Liang-Chi Hsieh's avatar
      [SPARK-14592][SQL] Native support for CREATE TABLE LIKE DDL command · 28efdd3f
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      JIRA: https://issues.apache.org/jira/browse/SPARK-14592
      
      This patch adds native support for DDL command `CREATE TABLE LIKE`.
      
      The SQL syntax is like:
      
          CREATE TABLE table_name LIKE existing_table
          CREATE TABLE IF NOT EXISTS table_name LIKE existing_table
      
      ## How was this patch tested?
      `HiveDDLCommandSuite`. `HiveQuerySuite` already tests `CREATE TABLE LIKE`.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Andrew Or <andrew@databricks.com>
      
      Closes #12362 from viirya/create-table-like.
      28efdd3f
    • gatorsmile's avatar
      [SPARK-14499][SQL][TEST] Drop Partition Does Not Delete Data of External Tables · c971aee4
      gatorsmile authored
      #### What changes were proposed in this pull request?
      This PR is to add a test to ensure drop partitions of an external table will not delete data.
      
      cc yhuai andrewor14
      
      #### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Andrew Or <andrew@databricks.com>
      
      Closes #12350 from gatorsmile/testDropPartition.
      c971aee4
    • Wenchen Fan's avatar
      [SPARK-14558][CORE] In ClosureCleaner, clean the outer pointer if it's a REPL line object · 1d04c86f
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      When we clean a closure, if its outermost parent is not a closure, we won't clone and clean it as cloning user's objects is dangerous. However, if it's a REPL line object, which may carry a lot of unnecessary references(like hadoop conf, spark conf, etc.), we should clean it as it's not a user object.
      
      This PR improves the check for user's objects to exclude REPL line object.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12327 from cloud-fan/closure.
      1d04c86f
    • Reynold Xin's avatar
      [SPARK-14617] Remove deprecated APIs in TaskMetrics · a46f98d3
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes some of the deprecated APIs in TaskMetrics. This is part of my bigger effort to simplify accumulators and task metrics.
      
      ## How was this patch tested?
      N/A - only removals
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12375 from rxin/SPARK-14617.
      a46f98d3
    • Reynold Xin's avatar
      [SPARK-14619] Track internal accumulators (metrics) by stage attempt · dac40b68
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      When there are multiple attempts for a stage, we currently only reset internal accumulator values if all the tasks are resubmitted. It would make more sense to reset the accumulator values for each stage attempt. This will allow us to eventually get rid of the internal flag in the Accumulator class. This is part of my bigger effort to simplify accumulators and task metrics.
      
      ## How was this patch tested?
      Covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12378 from rxin/SPARK-14619.
      dac40b68
    • Sean Owen's avatar
      [SPARK-14612][ML] Consolidate the version of dependencies in mllib and mllib-local into one place · 9fa43a33
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Move json4s, breeze dependency declaration into parent
      
      ## How was this patch tested?
      
      Should be no functional change, but Jenkins tests will test that.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #12390 from srowen/SPARK-14612.
      9fa43a33
    • Liwei Lin's avatar
      [SPARK-14630][BUILD][CORE][SQL][STREAMING] Code style: public abstract methods... · 3e27940a
      Liwei Lin authored
      [SPARK-14630][BUILD][CORE][SQL][STREAMING] Code style: public abstract methods should have explicit return types
      
      ## What changes were proposed in this pull request?
      
      Currently many public abstract methods (in abstract classes as well as traits) don't declare return types explicitly, such as in [o.a.s.streaming.dstream.InputDStream](https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/InputDStream.scala#L110):
      ```scala
      def start() // should be: def start(): Unit
      def stop()  // should be: def stop(): Unit
      ```
      
      These methods exist in core, sql, streaming; this PR fixes them.
      
      ## How was this patch tested?
      
      N/A
      
      ## Which piece of scala style rule led to the changes?
      
      the rule was added separately in https://github.com/apache/spark/pull/12396
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #12389 from lw-lin/public-abstract-methods.
      3e27940a
    • Reynold Xin's avatar
      [SPARK-14625] TaskUIData and ExecutorUIData shouldn't be case classes · de2ad528
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      I was trying to understand the accumulator and metrics update source code and these two classes don't really need to be case classes. It would also be more consistent with other UI classes if they are not case classes. This is part of my bigger effort to simplify accumulators and task metrics.
      
      ## How was this patch tested?
      This is a straightforward refactoring without behavior change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12386 from rxin/SPARK-14625.
      de2ad528
    • gatorsmile's avatar
      [SPARK-14125][SQL] Native DDL Support: Alter View · 0d22092c
      gatorsmile authored
      #### What changes were proposed in this pull request?
      This PR is to provide a native DDL support for the following three Alter View commands:
      
      Based on the Hive DDL document:
      https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
      ##### 1. ALTER VIEW RENAME
      **Syntax:**
      ```SQL
      ALTER VIEW view_name RENAME TO new_view_name
      ```
      - to change the name of a view to a different name
      - not allowed to rename a view's name by ALTER TABLE
      
      ##### 2. ALTER VIEW SET TBLPROPERTIES
      **Syntax:**
      ```SQL
      ALTER VIEW view_name SET TBLPROPERTIES ('comment' = new_comment);
      ```
      - to add metadata to a view
      - not allowed to set views' properties by ALTER TABLE
      - ignore it if trying to set a view's existing property key when the value is the same
      - overwrite the value if trying to set a view's existing key to a different value
      
      ##### 3. ALTER VIEW UNSET TBLPROPERTIES
      **Syntax:**
      ```SQL
      ALTER VIEW view_name UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key')
      ```
      - to remove metadata from a view
      - not allowed to unset views' properties by ALTER TABLE
      - issue an exception if trying to unset a view's non-existent key
      
      #### How was this patch tested?
      Added test cases to verify if it works properly.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #12324 from gatorsmile/alterView.
      0d22092c
    • Dhruve Ashar's avatar
      [SPARK-14572][DOC] Update config docs to allow -Xms in extraJavaOptions · f83ba454
      Dhruve Ashar authored
      ## What changes were proposed in this pull request?
      The configuration docs are updated to reflect the changes introduced with [SPARK-12384](https://issues.apache.org/jira/browse/SPARK-12384). This allows the user to specify initial heap memory settings through the extraJavaOptions for executor, driver and am.
      
      ## How was this patch tested?
      The changes are tested in [SPARK-12384](https://issues.apache.org/jira/browse/SPARK-12384). This is just documenting the changes made.
      
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #12333 from dhruve/doc/SPARK-14572.
      f83ba454
Loading