Skip to content
Snippets Groups Projects
  1. Dec 15, 2016
  2. Dec 14, 2016
  3. Dec 13, 2016
  4. Dec 12, 2016
    • Felix Cheung's avatar
      [SPARK-18810][SPARKR] SparkR install.spark does not work for RCs, snapshots · 1aeb7f42
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Support overriding the download url (include version directory) in an environment variable, `SPARKR_RELEASE_DOWNLOAD_URL`
      
      ## How was this patch tested?
      
      unit test, manually testing
      - snapshot build url
        - download when spark jar not cached
        - when spark jar is cached
      - RC build url
        - download when spark jar not cached
        - when spark jar is cached
      - multiple cached spark versions
      - starting with sparkR shell
      
      To use this,
      ```
      SPARKR_RELEASE_DOWNLOAD_URL=http://this_is_the_url_to_spark_release_tgz
      
       R
      ```
      then in R,
      ```
      library(SparkR) # or specify lib.loc
      sparkR.session()
      ```
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16248 from felixcheung/rinstallurl.
      
      (cherry picked from commit 8a51cfdc)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      1aeb7f42
  5. Dec 09, 2016
    • Felix Cheung's avatar
      [SPARK-18807][SPARKR] Should suppress output print for calls to JVM methods with void return values · 8bf56cc4
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      Several SparkR API calling into JVM methods that have void return values are getting printed out, especially when running in a REPL or IDE.
      example:
      ```
      > setLogLevel("WARN")
      NULL
      ```
      We should fix this to make the result more clear.
      
      Also found a small change to return value of dropTempView in 2.1 - adding doc and test for it.
      
      ## How was this patch tested?
      
      manually - I didn't find a expect_*() method in testthat for this
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16237 from felixcheung/rinvis.
      
      (cherry picked from commit 3e11d5bf)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      8bf56cc4
    • wm624@hotmail.com's avatar
      [SPARK-18349][SPARKR] Update R API documentation on ml model summary · 4ceed95b
      wm624@hotmail.com authored
      
      ## What changes were proposed in this pull request?
      In this PR, the document of `summary` method is improved in the format:
      
      returns summary information of the fitted model, which is a list. The list includes .......
      
      Since `summary` in R is mainly about the model, which is not the same as `summary` object on scala side, if there is one, the scala API doc is not pointed here.
      
      In current document, some `return` have `.` and some don't have. `.` is added to missed ones.
      
      Since spark.logit `summary` has a big refactoring, this PR doesn't include this one. It will be changed when the `spark.logit` PR is merged.
      
      ## How was this patch tested?
      
      Manual build.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16150 from wangmiao1981/audit2.
      
      (cherry picked from commit 86a96034)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      4ceed95b
  6. Dec 08, 2016
    • Felix Cheung's avatar
      [SPARK-18590][SPARKR] build R source package when making distribution · d69df907
      Felix Cheung authored
      This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)
      
      But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below.
      
      This PR also includes a few minor fixes.
      
      These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md
      
      ) on what's going to a CRAN release, which is now run during make-distribution.sh.
      1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path
      2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation)
      3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN
       (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests)
      4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1)
       (the output of this step is what we package into Spark dist and sparkr.zip)
      
      Alternatively,
         R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead.
       But in any case, despite installing the package multiple times this is relatively fast.
      Building vignettes takes a while though.
      
      Manually, CI.
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16014 from felixcheung/rdist.
      
      (cherry picked from commit c3d3a9d0)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      d69df907
    • Patrick Wendell's avatar
      48aa6775
    • Patrick Wendell's avatar
      Preparing Spark release v2.1.0-rc2 · 08071749
      Patrick Wendell authored
      08071749
  7. Dec 07, 2016
    • Yanbo Liang's avatar
      [SPARK-18326][SPARKR][ML] Review SparkR ML wrappers API for 2.1 · 1c3f1da8
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Reviewing SparkR ML wrappers API for 2.1 release, mainly two issues:
      * Remove ```probabilityCol``` from the argument list of ```spark.logit``` and ```spark.randomForest```. Since it was used when making prediction and should be an argument of ```predict```, and we will work on this at [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618
      
      ) in the next release cycle.
      * Fix ```spark.als``` params to make it consistent with MLlib.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16169 from yanboliang/spark-18326.
      
      (cherry picked from commit 97255497)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      1c3f1da8
    • Sean Owen's avatar
      [SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils · 51754d6d
      Sean Owen authored
      
      ## What changes were proposed in this pull request?
      
      Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k.
      
      ## How was this patch tested?
      
      Existing test plus new test case.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16129 from srowen/SPARK-18678.
      
      (cherry picked from commit 79f5f281)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      51754d6d
    • Yanbo Liang's avatar
      [SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit. · 340e9aea
      Yanbo Liang authored
      
      ## What changes were proposed in this pull request?
      Several cleanup and improvements for ```spark.logit```:
      * ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.
      * ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.
      * SparkR test improvement: comparing the training result with native R glmnet.
      * Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.
      
      ## How was this patch tested?
      Unit tests.
      
      The ```summary``` output after this change:
      multinomial logistic regression:
      ```
      > df <- suppressWarnings(createDataFrame(iris))
      > model <- spark.logit(df, Species ~ ., regParam = 0.5)
      > summary(model)
      $coefficients
                   versicolor  virginica   setosa
      (Intercept)  1.514031    -2.609108   1.095077
      Sepal_Length 0.02511006  0.2649821   -0.2900921
      Sepal_Width  -0.5291215  -0.02016446 0.549286
      Petal_Length 0.03647411  0.1544119   -0.190886
      Petal_Width  0.000236092 0.4195804   -0.4198165
      ```
      binomial logistic regression:
      ```
      > df <- suppressWarnings(createDataFrame(iris))
      > training <- df[df$Species %in% c("versicolor", "virginica"), ]
      > model <- spark.logit(training, Species ~ ., regParam = 0.5)
      > summary(model)
      $coefficients
                   Estimate
      (Intercept)  -6.053815
      Sepal_Length 0.2449379
      Sepal_Width  0.1648321
      Petal_Length 0.4730718
      Petal_Width  1.031947
      ```
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16117 from yanboliang/spark-18686.
      
      (cherry picked from commit 90b59d1b)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      340e9aea
  8. Dec 04, 2016
    • Felix Cheung's avatar
      [SPARK-18643][SPARKR] SparkR hangs at session start when installed as a package without Spark · c13c2939
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      If SparkR is running as a package and it has previously downloaded Spark Jar it should be able to run as before without having to set SPARK_HOME. Basically with this bug the auto install Spark will only work in the first session.
      
      This seems to be a regression on the earlier behavior.
      
      Fix is to always try to install or check for the cached Spark if running in an interactive session.
      As discussed before, we should probably only install Spark iff running in an interactive session (R shell, RStudio etc)
      
      ## How was this patch tested?
      
      Manually
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16077 from felixcheung/rsessioninteractive.
      
      (cherry picked from commit b019b3a8)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      c13c2939
  9. Dec 02, 2016
  10. Nov 30, 2016
  11. Nov 28, 2016
  12. Nov 23, 2016
    • Burak Yavuz's avatar
      [SPARK-18510] Fix data corruption from inferred partition column dataTypes · 15d2cf26
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      ### The Issue
      
      If I specify my schema when doing
      ```scala
      spark.read
        .schema(someSchemaWherePartitionColumnsAreStrings)
      ```
      but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted.
      
      ### Proposed solution
      
      The partition handling code path is kind of a mess. In my fix I'm probably adding to the mess, but at least trying to standardize the code path.
      
      The real issue is that a user that uses the `spark.read` code path can never clearly specify what the partition columns are. If you try to specify the fields in `schema`, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption.
      
      My solution tries to fix this by always trying to infer partition columns the first time you specify the table. Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type.
      
      We will ALWAYS append partition columns to the user's schema, even if they didn't ask for it. We will only use the data type they provided if they specified it. While this is confusing, this has been the behavior since Spark 1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. We may revisit this decision later.
      
      A side effect of this PR is that we won't need https://github.com/apache/spark/pull/15942
      
       if this PR goes in.
      
      ## How was this patch tested?
      
      Regression tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #15951 from brkyvz/partition-corruption.
      
      (cherry picked from commit 0d1bf2b6)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      15d2cf26
    • Sean Owen's avatar
      [SPARK-18073][DOCS][WIP] Migrate wiki to spark.apache.org web site · 5f198d20
      Sean Owen authored
      
      ## What changes were proposed in this pull request?
      
      Updates links to the wiki to links to the new location of content on spark.apache.org.
      
      ## How was this patch tested?
      
      Doc builds
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15967 from srowen/SPARK-18073.1.
      
      (cherry picked from commit 7e0cd1d9)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      5f198d20
  13. Nov 22, 2016
    • Yanbo Liang's avatar
      [SPARK-18501][ML][SPARKR] Fix spark.glm errors when fitting on collinear data · fc5fee83
      Yanbo Liang authored
      
      ## What changes were proposed in this pull request?
      * Fix SparkR ```spark.glm``` errors when fitting on collinear data, since ```standard error of coefficients, t value and p value``` are not available in this condition.
      * Scala/Python GLM summary should throw exception if users get ```standard error of coefficients, t value and p value``` but the underlying WLS was solved by local "l-bfgs".
      
      ## How was this patch tested?
      Add unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15930 from yanboliang/spark-18501.
      
      (cherry picked from commit 982b82e3)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      fc5fee83
    • hyukjinkwon's avatar
      [SPARK-18514][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that` across R API documentation · 63aa01ff
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      It seems in R, there are
      
      - `Note:`
      - `NOTE:`
      - `Note that`
      
      This PR proposes to fix those to `Note:` to be consistent.
      
      **Before**
      
      ![2016-11-21 11 30 07](https://cloud.githubusercontent.com/assets/6477701/20468848/2f27b0fa-afde-11e6-89e3-993701269dbe.png)
      
      **After**
      
      ![2016-11-21 11 29 44](https://cloud.githubusercontent.com/assets/6477701/20468851/39469664-afde-11e6-9929-ad80be7fc405.png
      
      )
      
      ## How was this patch tested?
      
      The notes were found via
      
      ```bash
      grep -r "NOTE: " .
      grep -r "Note that " .
      ```
      
      And then fixed one by one comparing with API documentation.
      
      After that, manually tested via `sh create-docs.sh` under `./R`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15952 from HyukjinKwon/SPARK-18514.
      
      (cherry picked from commit 4922f9cd)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      63aa01ff
    • Yanbo Liang's avatar
      [SPARK-18444][SPARKR] SparkR running in yarn-cluster mode should not download Spark package. · c7021407
      Yanbo Liang authored
      
      ## What changes were proposed in this pull request?
      When running SparkR job in yarn-cluster mode, it will download Spark package from apache website which is not necessary.
      ```
      ./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R
      ```
      The following is output:
      ```
      Attaching package: ‘SparkR’
      
      The following objects are masked from ‘package:stats’:
      
          cov, filter, lag, na.omit, predict, sd, var, window
      
      The following objects are masked from ‘package:base’:
      
          as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
          rank, rbind, sample, startsWith, subset, summary, transform, union
      
      Spark not found in SPARK_HOME:
      Spark not found in the cache directory. Installation will start.
      MirrorUrl not provided.
      Looking for preferred site from apache website...
      ......
      ```
      There's no ```SPARK_HOME``` in yarn-cluster mode since the R process is in a remote host of the yarn cluster rather than in the client host. The JVM comes up first and the R process then connects to it. So in such cases we should never have to download Spark as Spark is already running.
      
      ## How was this patch tested?
      Offline test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15888 from yanboliang/spark-18444.
      
      (cherry picked from commit acb97157)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      c7021407
  14. Nov 17, 2016
  15. Nov 16, 2016
  16. Nov 13, 2016
    • Yanbo Liang's avatar
      [SPARK-18412][SPARKR][ML] Fix exception for some SparkR ML algorithms training on libsvm data · 8fc6455c
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      * Fix the following exceptions which throws when ```spark.randomForest```(classification), ```spark.gbt```(classification), ```spark.naiveBayes``` and ```spark.glm```(binomial family) were fitted on libsvm data.
      ```
      java.lang.IllegalArgumentException: requirement failed: If label column already exists, forceIndexLabel can not be set with true.
      ```
      See [SPARK-18412](https://issues.apache.org/jira/browse/SPARK-18412
      
      ) for more detail about how to reproduce this bug.
      * Refactor out ```getFeaturesAndLabels``` to RWrapperUtils, since lots of ML algorithm wrappers use this function.
      * Drop some unwanted columns when making prediction.
      
      ## How was this patch tested?
      Add unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15851 from yanboliang/spark-18412.
      
      (cherry picked from commit 07be232e)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      8fc6455c
  17. Nov 11, 2016
  18. Nov 10, 2016
  19. Nov 08, 2016
    • Felix Cheung's avatar
      [SPARK-18239][SPARKR] Gradient Boosted Tree for R · 98dd7ac7
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      Gradient Boosted Tree in R.
      With a few minor improvements to RandomForest in R.
      
      Since this is relatively isolated I'd like to target this for branch-2.1
      
      ## How was this patch tested?
      
      manual tests, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15746 from felixcheung/rgbt.
      
      (cherry picked from commit 55964c15)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      98dd7ac7
  20. Nov 07, 2016
    • Yanbo Liang's avatar
      [SPARK-18291][SPARKR][ML] SparkR glm predict should output original label when family = binomial. · 6b332909
      Yanbo Liang authored
      
      ## What changes were proposed in this pull request?
      SparkR ```spark.glm``` predict should output original label when family = "binomial".
      
      ## How was this patch tested?
      Add unit test.
      You can also run the following code to test:
      ```R
      training <- suppressWarnings(createDataFrame(iris))
      training <- training[training$Species %in% c("versicolor", "virginica"), ]
      model <- spark.glm(training, Species ~ Sepal_Length + Sepal_Width,family = binomial(link = "logit"))
      showDF(predict(model, training))
      ```
      Before this change:
      ```
      +------------+-----------+------------+-----------+----------+-----+-------------------+
      |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   Species|label|         prediction|
      +------------+-----------+------------+-----------+----------+-----+-------------------+
      |         7.0|        3.2|         4.7|        1.4|versicolor|  0.0| 0.8271421517601544|
      |         6.4|        3.2|         4.5|        1.5|versicolor|  0.0| 0.6044595910413112|
      |         6.9|        3.1|         4.9|        1.5|versicolor|  0.0| 0.7916340858281998|
      |         5.5|        2.3|         4.0|        1.3|versicolor|  0.0|0.16080518180591158|
      |         6.5|        2.8|         4.6|        1.5|versicolor|  0.0| 0.6112229217050189|
      |         5.7|        2.8|         4.5|        1.3|versicolor|  0.0| 0.2555087295500885|
      |         6.3|        3.3|         4.7|        1.6|versicolor|  0.0| 0.5681507664364834|
      |         4.9|        2.4|         3.3|        1.0|versicolor|  0.0|0.05990570219972002|
      |         6.6|        2.9|         4.6|        1.3|versicolor|  0.0| 0.6644434078306246|
      |         5.2|        2.7|         3.9|        1.4|versicolor|  0.0|0.11293577405862379|
      |         5.0|        2.0|         3.5|        1.0|versicolor|  0.0|0.06152372321585971|
      |         5.9|        3.0|         4.2|        1.5|versicolor|  0.0|0.35250697207602555|
      |         6.0|        2.2|         4.0|        1.0|versicolor|  0.0|0.32267018290814303|
      |         6.1|        2.9|         4.7|        1.4|versicolor|  0.0|  0.433391153814592|
      |         5.6|        2.9|         3.6|        1.3|versicolor|  0.0| 0.2280744262436993|
      |         6.7|        3.1|         4.4|        1.4|versicolor|  0.0| 0.7219848389339459|
      |         5.6|        3.0|         4.5|        1.5|versicolor|  0.0|0.23527698971404695|
      |         5.8|        2.7|         4.1|        1.0|versicolor|  0.0|  0.285024533520016|
      |         6.2|        2.2|         4.5|        1.5|versicolor|  0.0| 0.4107047877447493|
      |         5.6|        2.5|         3.9|        1.1|versicolor|  0.0|0.20083561961645083|
      +------------+-----------+------------+-----------+----------+-----+-------------------+
      ```
      After this change:
      ```
      +------------+-----------+------------+-----------+----------+-----+----------+
      |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   Species|label|prediction|
      +------------+-----------+------------+-----------+----------+-----+----------+
      |         7.0|        3.2|         4.7|        1.4|versicolor|  0.0| virginica|
      |         6.4|        3.2|         4.5|        1.5|versicolor|  0.0| virginica|
      |         6.9|        3.1|         4.9|        1.5|versicolor|  0.0| virginica|
      |         5.5|        2.3|         4.0|        1.3|versicolor|  0.0|versicolor|
      |         6.5|        2.8|         4.6|        1.5|versicolor|  0.0| virginica|
      |         5.7|        2.8|         4.5|        1.3|versicolor|  0.0|versicolor|
      |         6.3|        3.3|         4.7|        1.6|versicolor|  0.0| virginica|
      |         4.9|        2.4|         3.3|        1.0|versicolor|  0.0|versicolor|
      |         6.6|        2.9|         4.6|        1.3|versicolor|  0.0| virginica|
      |         5.2|        2.7|         3.9|        1.4|versicolor|  0.0|versicolor|
      |         5.0|        2.0|         3.5|        1.0|versicolor|  0.0|versicolor|
      |         5.9|        3.0|         4.2|        1.5|versicolor|  0.0|versicolor|
      |         6.0|        2.2|         4.0|        1.0|versicolor|  0.0|versicolor|
      |         6.1|        2.9|         4.7|        1.4|versicolor|  0.0|versicolor|
      |         5.6|        2.9|         3.6|        1.3|versicolor|  0.0|versicolor|
      |         6.7|        3.1|         4.4|        1.4|versicolor|  0.0| virginica|
      |         5.6|        3.0|         4.5|        1.5|versicolor|  0.0|versicolor|
      |         5.8|        2.7|         4.1|        1.0|versicolor|  0.0|versicolor|
      |         6.2|        2.2|         4.5|        1.5|versicolor|  0.0|versicolor|
      |         5.6|        2.5|         3.9|        1.1|versicolor|  0.0|versicolor|
      +------------+-----------+------------+-----------+----------+-----+----------+
      ```
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15788 from yanboliang/spark-18291.
      
      (cherry picked from commit daa975f4)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      6b332909
  21. Nov 05, 2016
  22. Nov 04, 2016
  23. Nov 02, 2016
    • Wenchen Fan's avatar
      [SPARK-17470][SQL] unify path for data source table and locationUri for hive serde table · 5ea2f9e5
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      Due to a limitation of hive metastore(table location must be directory path, not file path), we always store `path` for data source table in storage properties, instead of the `locationUri` field. However, we should not expose this difference to `CatalogTable` level, but just treat it as a hack in `HiveExternalCatalog`, like we store table schema of data source table in table properties.
      
      This PR unifies `path` and `locationUri` outside of `HiveExternalCatalog`, both data source table and hive serde table should use the `locationUri` field.
      
      This PR also unifies the way we handle default table location for managed table. Previously, the default table location of hive serde managed table is set by external catalog, but the one of data source table is set by command. After this PR, we follow the hive way and the default table location is always set by external catalog.
      
      For managed non-file-based tables, we will assign a default table location and create an empty directory for it, the table location will be removed when the table is dropped. This is reasonable as metastore doesn't care about whether a table is file-based or not, and an empty table directory has no harm.
      For external non-file-based tables, ideally we can omit the table location, but due to a hive metastore issue, we will assign a random location to it, and remove it right after the table is created. See SPARK-15269 for more details. This is fine as it's well isolated in `HiveExternalCatalog`.
      
      To keep the existing behaviour of the `path` option, in this PR we always add the `locationUri` to storage properties using key `path`, before passing storage properties to `DataSource` as data source options.
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15024 from cloud-fan/path.
      
      (cherry picked from commit 3a1bc6f4)
      Signed-off-by: default avatarYin Huai <yhuai@databricks.com>
      5ea2f9e5
    • eyal farago's avatar
      [SPARK-16839][SQL] Simplify Struct creation code path · 41491e54
      eyal farago authored
      
      ## What changes were proposed in this pull request?
      
      Simplify struct creation, especially the aspect of `CleanupAliases` which missed some aliases when handling trees created by `CreateStruct`.
      
      This PR includes:
      
      1. A failing test (create struct with nested aliases, some of the aliases survive `CleanupAliases`).
      2. A fix that transforms `CreateStruct` into a `CreateNamedStruct` constructor, effectively eliminating `CreateStruct` from all expression trees.
      3. A `NamePlaceHolder` used by `CreateStruct` when column names cannot be extracted from unresolved `NamedExpression`.
      4. A new Analyzer rule that resolves `NamePlaceHolder` into a string literal once the `NamedExpression` is resolved.
      5. `CleanupAliases` code was simplified as it no longer has to deal with `CreateStruct`'s top level columns.
      
      ## How was this patch tested?
      Running all tests-suits in package org.apache.spark.sql, especially including the analysis suite, making sure added test initially fails, after applying suggested fix rerun the entire analysis package successfully.
      
      Modified few tests that expected `CreateStruct` which is now transformed into `CreateNamedStruct`.
      
      Author: eyal farago <eyal farago>
      Author: Herman van Hovell <hvanhovell@databricks.com>
      Author: eyal farago <eyal.farago@gmail.com>
      Author: Eyal Farago <eyal.farago@actimize.com>
      Author: Hyukjin Kwon <gurwls223@gmail.com>
      Author: eyalfa <eyal.farago@gmail.com>
      
      Closes #15718 from hvanhovell/SPARK-16839-2.
      
      (cherry picked from commit f151bd1a)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      41491e54
Loading