Skip to content
Snippets Groups Projects
  1. Dec 15, 2016
  2. Dec 14, 2016
    • Felix Cheung's avatar
      [SPARK-18849][ML][SPARKR][DOC] vignettes final check update · 2a8de2e1
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      doc cleanup
      
      ## How was this patch tested?
      
      ~~vignettes is not building for me. I'm going to kick off a full clean build and try again and attach output here for review.~~
      Output html here: https://felixcheung.github.io/sparkr-vignettes.html
      
      
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16286 from felixcheung/rvignettespass.
      
      (cherry picked from commit 7d858bc5)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      2a8de2e1
    • Dongjoon Hyun's avatar
      [SPARK-18875][SPARKR][DOCS] Fix R API doc generation by adding `DESCRIPTION` file · d399a297
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Since Apache Spark 1.4.0, R API document page has a broken link on `DESCRIPTION file` because Jekyll plugin script doesn't copy the file. This PR aims to fix that.
      
      - Official Latest Website: http://spark.apache.org/docs/latest/api/R/index.html
      - Apache Spark 2.1.0-rc2: http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html
      
      
      
      ## How was this patch tested?
      
      Manual.
      
      ```bash
      cd docs
      SKIP_SCALADOC=1 jekyll build
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16292 from dongjoon-hyun/SPARK-18875.
      
      (cherry picked from commit ec0eae48)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      d399a297
    • Reynold Xin's avatar
      [SPARK-18869][SQL] Add TreeNode.p that returns BaseType · b14fc391
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] rather than a more specific type. It would be easier for interactive debugging to introduce a function that returns the BaseType.
      
      ## How was this patch tested?
      N/A - this is a developer only feature used for interactive debugging. As long as it compiles, it should be good to go. I tested this in spark-shell.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16288 from rxin/SPARK-18869.
      
      (cherry picked from commit 5d510c69)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      b14fc391
    • Wenchen Fan's avatar
      [SPARK-18856][SQL] non-empty partitioned table should not report zero size · cb2c8428
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      In `DataSource`, if the table is not analyzed, we will use 0 as the default value for table size. This is dangerous, we may broadcast a large table and cause OOM. We should use `defaultSizeInBytes` instead.
      
      ## How was this patch tested?
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16280 from cloud-fan/bug.
      
      (cherry picked from commit d6f11a12)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      cb2c8428
    • wm624@hotmail.com's avatar
      [SPARK-18865][SPARKR] SparkR vignettes MLP and LDA updates · 0d94201e
      wm624@hotmail.com authored
      
      ## What changes were proposed in this pull request?
      
      When do the QA work, I found that the following issues:
      
      1). `spark.mlp` doesn't include an example;
      2). `spark.mlp` and `spark.lda` have redundant parameter explanations;
      3). `spark.lda` document misses default values for some parameters.
      
      I also changed the `spark.logit` regParam in the examples, as we discussed in #16222.
      
      ## How was this patch tested?
      
      Manual test
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16284 from wangmiao1981/ks.
      
      (cherry picked from commit 32438853)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      0d94201e
    • Reynold Xin's avatar
      [SPARK-18854][SQL] numberedTreeString and apply(i) inconsistent for subqueries · 280c35af
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      This is a bug introduced by subquery handling. numberedTreeString (which uses generateTreeString under the hood) numbers trees including innerChildren (used to print subqueries), but apply (which uses getNodeNumbered) ignores innerChildren. As a result, apply(i) would return the wrong plan node if there are subqueries.
      
      This patch fixes the bug.
      
      ## How was this patch tested?
      Added a test case in SubquerySuite.scala to test both the depth-first traversal of numbering as well as making sure the two methods are consistent.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16277 from rxin/SPARK-18854.
      
      (cherry picked from commit ffdd1fcd)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      280c35af
    • Joseph K. Bradley's avatar
      [SPARK-18795][ML][SPARKR][DOC] Added KSTest section to SparkR vignettes · d0d9c572
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Added short section for KSTest.
      Also added logreg model to list of ML models in vignette.  (This will be reorganized under SPARK-18849)
      
      ![screen shot 2016-12-14 at 1 37 31 pm](https://cloud.githubusercontent.com/assets/5084283/21202140/7f24e240-c202-11e6-9362-458208bb9159.png
      
      )
      
      ## How was this patch tested?
      
      Manually tested example locally.
      Built vignettes locally.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #16283 from jkbradley/ksTest-vignette.
      
      (cherry picked from commit 78627425)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      d0d9c572
    • Shixiong Zhu's avatar
      [SPARK-18852][SS] StreamingQuery.lastProgress should be null when recentProgress is empty · c4de90fc
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      Right now `StreamingQuery.lastProgress` throws NoSuchElementException and it's hard to be used in Python since Python user will just see Py4jError.
      
      This PR just makes it return null instead.
      
      ## How was this patch tested?
      
      `test("lastProgress should be null when recentProgress is empty")`
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16273 from zsxwing/SPARK-18852.
      
      (cherry picked from commit 1ac6567b)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      c4de90fc
    • Reynold Xin's avatar
      [SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating statistics · e8866f9f
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element.
      
      ## How was this patch tested?
      This should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16274 from rxin/SPARK-18853.
      
      (cherry picked from commit 5d799473)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      e8866f9f
    • hyukjinkwon's avatar
      [SPARK-18753][SQL] Keep pushed-down null literal as a filter in Spark-side... · af12a21c
      hyukjinkwon authored
      [SPARK-18753][SQL] Keep pushed-down null literal as a filter in Spark-side post-filter for FileFormat datasources
      
      ## What changes were proposed in this pull request?
      
      Currently, `FileSourceStrategy` does not handle the case when the pushed-down filter is `Literal(null)` and removes it at the post-filter in Spark-side.
      
      For example, the codes below:
      
      ```scala
      val df = Seq(Tuple1(Some(true)), Tuple1(None), Tuple1(Some(false))).toDF()
      df.filter($"_1" === "true").explain(true)
      ```
      
      shows it keeps `null` properly.
      
      ```
      == Parsed Logical Plan ==
      'Filter ('_1 = true)
      +- LocalRelation [_1#17]
      
      == Analyzed Logical Plan ==
      _1: boolean
      Filter (cast(_1#17 as double) = cast(true as double))
      +- LocalRelation [_1#17]
      
      == Optimized Logical Plan ==
      Filter (isnotnull(_1#17) && null)
      +- LocalRelation [_1#17]
      
      == Physical Plan ==
      *Filter (isnotnull(_1#17) && null)       << Here `null` is there
      +- LocalTableScan [_1#17]
      ```
      
      However, when we read it back from Parquet,
      
      ```scala
      val path = "/tmp/testfile"
      df.write.parquet(path)
      spark.read.parquet(path).filter($"_1" === "true").explain(true)
      ```
      
      `null` is removed at the post-filter.
      
      ```
      == Parsed Logical Plan ==
      'Filter ('_1 = true)
      +- Relation[_1#11] parquet
      
      == Analyzed Logical Plan ==
      _1: boolean
      Filter (cast(_1#11 as double) = cast(true as double))
      +- Relation[_1#11] parquet
      
      == Optimized Logical Plan ==
      Filter (isnotnull(_1#11) && null)
      +- Relation[_1#11] parquet
      
      == Physical Plan ==
      *Project [_1#11]
      +- *Filter isnotnull(_1#11)       << Here `null` is missing
         +- *FileScan parquet [_1#11] Batched: true, Format: ParquetFormat, Location: InMemoryFileIndex[file:/tmp/testfile], PartitionFilters: [null], PushedFilters: [IsNotNull(_1)], ReadSchema: struct<_1:boolean>
      ```
      
      This PR fixes it to keep it properly. In more details,
      
      ```scala
      val partitionKeyFilters =
        ExpressionSet(normalizedFilters.filter(_.references.subsetOf(partitionSet)))
      ```
      
      This keeps this `null` in `partitionKeyFilters` as `Literal` always don't have `children` and `references` is being empty  which is always the subset of `partitionSet`.
      
      And then in
      
      ```scala
      val afterScanFilters = filterSet -- partitionKeyFilters
      ```
      
      `null` is always removed from the post filter. So, if the referenced fields are empty, it should be applied into data columns too.
      
      After this PR, it becomes as below:
      
      ```
      == Parsed Logical Plan ==
      'Filter ('_1 = true)
      +- Relation[_1#276] parquet
      
      == Analyzed Logical Plan ==
      _1: boolean
      Filter (cast(_1#276 as double) = cast(true as double))
      +- Relation[_1#276] parquet
      
      == Optimized Logical Plan ==
      Filter (isnotnull(_1#276) && null)
      +- Relation[_1#276] parquet
      
      == Physical Plan ==
      *Project [_1#276]
      +- *Filter (isnotnull(_1#276) && null)
         +- *FileScan parquet [_1#276] Batched: true, Format: ParquetFormat, Location: InMemoryFileIndex[file:/private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-a5d59bdb-5b..., PartitionFilters: [null], PushedFilters: [IsNotNull(_1)], ReadSchema: struct<_1:boolean>
      ```
      
      ## How was this patch tested?
      
      Unit test in `FileSourceStrategySuite`
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16184 from HyukjinKwon/SPARK-18753.
      
      (cherry picked from commit 89ae26dc)
      Signed-off-by: default avatarCheng Lian <lian@databricks.com>
      af12a21c
    • Cheng Lian's avatar
      [SPARK-18730] Post Jenkins test report page instead of the full console output page to GitHub · 16d4bd4a
      Cheng Lian authored
      
      ## What changes were proposed in this pull request?
      
      Currently, the full console output page of a Spark Jenkins PR build can be as large as several megabytes. It takes a relatively long time to load and may even freeze the browser for quite a while.
      
      This PR makes the build script to post the test report page link to GitHub instead. The test report page is way more concise and is usually the first page I'd like to check when investigating a Jenkins build failure.
      
      Note that for builds that a test report is not available (ongoing builds and builds that fail before test execution), the test report link automatically redirects to the build page.
      
      ## How was this patch tested?
      
      N/A.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #16163 from liancheng/jenkins-test-report.
      
      (cherry picked from commit ba4aab9b)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      16d4bd4a
    • Nattavut Sutyanyong's avatar
      [SPARK-18814][SQL] CheckAnalysis rejects TPCDS query 32 · f999312e
      Nattavut Sutyanyong authored
      
      ## What changes were proposed in this pull request?
      Move the checking of GROUP BY column in correlated scalar subquery from CheckAnalysis
      to Analysis to fix a regression caused by SPARK-18504.
      
      This problem can be reproduced with a simple script now.
      
      Seq((1,1)).toDF("pk","pv").createOrReplaceTempView("p")
      Seq((1,1)).toDF("ck","cv").createOrReplaceTempView("c")
      sql("select * from p,c where p.pk=c.ck and c.cv = (select avg(c1.cv) from c c1 where c1.ck = p.pk)").show
      
      The requirements are:
      1. We need to reference the same table twice in both the parent and the subquery. Here is the table c.
      2. We need to have a correlated predicate but to a different table. Here is from c (as c1) in the subquery to p in the parent.
      3. We will then "deduplicate" c1.ck in the subquery to `ck#<n1>#<n2>` at `Project` above `Aggregate` of `avg`. Then when we compare `ck#<n1>#<n2>` and the original group by column `ck#<n1>` by their canonicalized form, which is #<n2> != #<n1>. That's how we trigger the exception added in SPARK-18504.
      
      ## How was this patch tested?
      
      SubquerySuite and a simplified version of TPCDS-Q32
      
      Author: Nattavut Sutyanyong <nsy.can@gmail.com>
      
      Closes #16246 from nsyca/18814.
      
      (cherry picked from commit cccd6439)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      f999312e
  3. Dec 13, 2016
  4. Dec 12, 2016
    • Felix Cheung's avatar
      [SPARK-18810][SPARKR] SparkR install.spark does not work for RCs, snapshots · 1aeb7f42
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Support overriding the download url (include version directory) in an environment variable, `SPARKR_RELEASE_DOWNLOAD_URL`
      
      ## How was this patch tested?
      
      unit test, manually testing
      - snapshot build url
        - download when spark jar not cached
        - when spark jar is cached
      - RC build url
        - download when spark jar not cached
        - when spark jar is cached
      - multiple cached spark versions
      - starting with sparkR shell
      
      To use this,
      ```
      SPARKR_RELEASE_DOWNLOAD_URL=http://this_is_the_url_to_spark_release_tgz
      
       R
      ```
      then in R,
      ```
      library(SparkR) # or specify lib.loc
      sparkR.session()
      ```
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16248 from felixcheung/rinstallurl.
      
      (cherry picked from commit 8a51cfdc)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      1aeb7f42
    • Yuming Wang's avatar
      [SPARK-18681][SQL] Fix filtering to compatible with partition keys of type int · 523071f3
      Yuming Wang authored
      
      ## What changes were proposed in this pull request?
      
      Cloudera put `/var/run/cloudera-scm-agent/process/15000-hive-HIVEMETASTORE/hive-site.xml` as the configuration file for the Hive Metastore Server, where `hive.metastore.try.direct.sql=false`. But Spark isn't reading this configuration file and get default value `hive.metastore.try.direct.sql=true`. As mallman said, we should use `getMetaConf` method to obtain the original configuration from Hive Metastore Server. I have tested this method few times and the return value is always consistent with Hive Metastore Server.
      
      ## How was this patch tested?
      
      The existing tests.
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #16122 from wangyum/SPARK-18681.
      
      (cherry picked from commit 90abfd15)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      523071f3
    • Bill Chambers's avatar
      [DOCS][MINOR] Clarify Where AccumulatorV2s are Displayed · 35011608
      Bill Chambers authored
      ## What changes were proposed in this pull request?
      
      This PR clarifies where accumulators will be displayed.
      
      ## How was this patch tested?
      
      No testing.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Bill Chambers <bill@databricks.com>
      Author: anabranch <wac.chambers@gmail.com>
      Author: Bill Chambers <wchambers@ischool.berkeley.edu>
      
      Closes #16180 from anabranch/improve-acc-docs.
      
      (cherry picked from commit 70ffff21)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      35011608
    • Tyson Condie's avatar
      [SPARK-18790][SS] Keep a general offset history of stream batches · 63693c17
      Tyson Condie authored
      ## What changes were proposed in this pull request?
      
      Instead of only keeping the minimum number of offsets around, we should keep enough information to allow us to roll back n batches and reexecute the stream starting from a given point. In particular, we should create a config in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and ensure that we keep enough log files in the following places to roll back the specified number of batches:
      the offsets that are present in each batch
      versions of the state store
      the files lists stored for the FileStreamSource
      the metadata log stored by the FileStreamSink
      
      marmbrus zsxwing
      
      ## How was this patch tested?
      
      The following tests were added.
      
      ### StreamExecution offset metadata
      Test added to StreamingQuerySuite that ensures offset metadata is garbage collected according to minBatchesRetain
      
      ### CompactibleFileStreamLog
      Tests added in CompactibleFileStreamLogSuite to ensure that logs are purged starting before the first compaction file that proceeds the current batch id - minBatchesToRetain.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Tyson Condie <tcondie@gmail.com>
      
      Closes #16219 from tcondie/offset_hist.
      
      (cherry picked from commit 83a42897)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      63693c17
  5. Dec 11, 2016
Loading