- Dec 14, 2016
-
-
Cheng Lian authored
## What changes were proposed in this pull request? Currently, the full console output page of a Spark Jenkins PR build can be as large as several megabytes. It takes a relatively long time to load and may even freeze the browser for quite a while. This PR makes the build script to post the test report page link to GitHub instead. The test report page is way more concise and is usually the first page I'd like to check when investigating a Jenkins build failure. Note that for builds that a test report is not available (ongoing builds and builds that fail before test execution), the test report link automatically redirects to the build page. ## How was this patch tested? N/A. Author: Cheng Lian <lian@databricks.com> Closes #16163 from liancheng/jenkins-test-report. (cherry picked from commit ba4aab9b) Signed-off-by:
Reynold Xin <rxin@databricks.com>
-
Nattavut Sutyanyong authored
## What changes were proposed in this pull request? Move the checking of GROUP BY column in correlated scalar subquery from CheckAnalysis to Analysis to fix a regression caused by SPARK-18504. This problem can be reproduced with a simple script now. Seq((1,1)).toDF("pk","pv").createOrReplaceTempView("p") Seq((1,1)).toDF("ck","cv").createOrReplaceTempView("c") sql("select * from p,c where p.pk=c.ck and c.cv = (select avg(c1.cv) from c c1 where c1.ck = p.pk)").show The requirements are: 1. We need to reference the same table twice in both the parent and the subquery. Here is the table c. 2. We need to have a correlated predicate but to a different table. Here is from c (as c1) in the subquery to p in the parent. 3. We will then "deduplicate" c1.ck in the subquery to `ck#<n1>#<n2>` at `Project` above `Aggregate` of `avg`. Then when we compare `ck#<n1>#<n2>` and the original group by column `ck#<n1>` by their canonicalized form, which is #<n2> != #<n1>. That's how we trigger the exception added in SPARK-18504. ## How was this patch tested? SubquerySuite and a simplified version of TPCDS-Q32 Author: Nattavut Sutyanyong <nsy.can@gmail.com> Closes #16246 from nsyca/18814. (cherry picked from commit cccd6439) Signed-off-by:
Herman van Hovell <hvanhovell@databricks.com>
-
- Dec 13, 2016
-
-
wm624@hotmail.com authored
## What changes were proposed in this pull request? While adding vignettes for kstest, I found some errors in the example: 1. There is a typo of kstest; 2. print.summary.KStest doesn't work with the example; Fix the example errors; Add a new unit test for print.summary.KStest; ## How was this patch tested? Manual test; Add new unit test; Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16259 from wangmiao1981/ks. (cherry picked from commit f2ddabfa) Signed-off-by:
Yanbo Liang <ybliang8@gmail.com>
-
Shixiong Zhu authored
## What changes were proposed in this pull request? Disable KafkaSourceStressForDontFailOnDataLossSuite for now. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16275 from zsxwing/ignore-flaky-test. (cherry picked from commit e104e55c) Signed-off-by:
Reynold Xin <rxin@databricks.com>
-
Xiangrui Meng authored
## What changes were proposed in this pull request? Mention `spark.randomForest` and `spark.gbt` in vignettes. Keep the content minimal since users can type `?spark.randomForest` to see the full doc. cc: jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #16264 from mengxr/SPARK-18793. (cherry picked from commit 594b14f1) Signed-off-by:
Xiangrui Meng <meng@databricks.com>
-
Tathagata Das authored
## What changes were proposed in this pull request? - Changed `StreamingQueryProgress.watermark` to `StreamingQueryProgress.queryTimestamps` which is a `Map[String, String]` containing the following keys: "eventTime.max", "eventTime.min", "eventTime.avg", "processingTime", "watermark". All of them UTC formatted strings. - Renamed `StreamingQuery.timestamp` to `StreamingQueryProgress.triggerTimestamp` to differentiate from `queryTimestamps`. It has the timestamp of when the trigger was started. ## How was this patch tested? Updated tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #16258 from tdas/SPARK-18834. (cherry picked from commit c68fb426) Signed-off-by:
Tathagata Das <tathagata.das1565@gmail.com>
-
Shixiong Zhu authored
## What changes were proposed in this pull request? This PR fixes the timeout value in `awaitResultInForkJoinSafely` for 2.1 and 2.0. Master has been fixed by https://github.com/apache/spark/pull/16230. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16268 from zsxwing/SPARK-18843.
-
Alex Bozarth authored
## What changes were proposed in this pull request? When I added a visibility check for the logs column on the executors page in #14382 the method I used only ran the check on the initial DataTable creation and not subsequent page loads. I moved the check out of the table definition and instead it runs on each page load. The jQuery DataTable functionality used is the same. ## How was this patch tested? Tested Manually No visible UI changes to screenshot. Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #16256 from ajbozarth/spark18816. (cherry picked from commit aebf44e5) Signed-off-by:
Sean Owen <sowen@cloudera.com>
-
jerryshao authored
[SPARK-18840][YARN] Avoid throw exception when getting token renewal interval in non HDFS security environment ## What changes were proposed in this pull request? Fix `java.util.NoSuchElementException` when running Spark in non-hdfs security environment. In the current code, we assume `HDFS_DELEGATION_KIND` token will be found in Credentials. But in some cloud environments, HDFS is not required, so we should avoid this exception. ## How was this patch tested? Manually verified in local environment. Author: jerryshao <sshao@hortonworks.com> Closes #16265 from jerryshao/SPARK-18840. (cherry picked from commit 43298d15) Signed-off-by:
Marcelo Vanzin <vanzin@cloudera.com>
-
Marcelo Vanzin authored
This avoids issues during maven tests because of shading. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #16260 from vanzin/SPARK-18835. (cherry picked from commit f280ccf4) Signed-off-by:
Marcelo Vanzin <vanzin@cloudera.com>
-
wm624@hotmail.com authored
## What changes were proposed in this pull request? spark.logit is added in 2.1. We need to update spark-vignettes to reflect the changes. This is part of SparkR QA work. ## How was this patch tested? Manual build html. Please see attached image for the result.  Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16222 from wangmiao1981/veg. (cherry picked from commit 2aa16d03) Signed-off-by:
Xiangrui Meng <meng@databricks.com>
-
Shixiong Zhu authored
## What changes were proposed in this pull request? Major change in this PR: - Add `pendingQueryNames` and `pendingQueryIds` to track that are going to start but not yet put into `activeQueries` so that we don't need to hold a lock when starting a query. Minor changes: - Fix a potential NPE when the user sets `checkpointLocation` using SQLConf but doesn't specify a query name. - Add missing docs in `StreamingQueryListener` ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16220 from zsxwing/SPARK-18796. (cherry picked from commit 417e45c5) Signed-off-by:
Tathagata Das <tathagata.das1565@gmail.com>
-
- Dec 12, 2016
-
-
Felix Cheung authored
## What changes were proposed in this pull request? Support overriding the download url (include version directory) in an environment variable, `SPARKR_RELEASE_DOWNLOAD_URL` ## How was this patch tested? unit test, manually testing - snapshot build url - download when spark jar not cached - when spark jar is cached - RC build url - download when spark jar not cached - when spark jar is cached - multiple cached spark versions - starting with sparkR shell To use this, ``` SPARKR_RELEASE_DOWNLOAD_URL=http://this_is_the_url_to_spark_release_tgz R ``` then in R, ``` library(SparkR) # or specify lib.loc sparkR.session() ``` Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16248 from felixcheung/rinstallurl. (cherry picked from commit 8a51cfdc) Signed-off-by:
Shivaram Venkataraman <shivaram@cs.berkeley.edu>
-
Yuming Wang authored
## What changes were proposed in this pull request? Cloudera put `/var/run/cloudera-scm-agent/process/15000-hive-HIVEMETASTORE/hive-site.xml` as the configuration file for the Hive Metastore Server, where `hive.metastore.try.direct.sql=false`. But Spark isn't reading this configuration file and get default value `hive.metastore.try.direct.sql=true`. As mallman said, we should use `getMetaConf` method to obtain the original configuration from Hive Metastore Server. I have tested this method few times and the return value is always consistent with Hive Metastore Server. ## How was this patch tested? The existing tests. Author: Yuming Wang <wgyumg@gmail.com> Closes #16122 from wangyum/SPARK-18681. (cherry picked from commit 90abfd15) Signed-off-by:
Herman van Hovell <hvanhovell@databricks.com>
-
Bill Chambers authored
## What changes were proposed in this pull request? This PR clarifies where accumulators will be displayed. ## How was this patch tested? No testing. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Bill Chambers <bill@databricks.com> Author: anabranch <wac.chambers@gmail.com> Author: Bill Chambers <wchambers@ischool.berkeley.edu> Closes #16180 from anabranch/improve-acc-docs. (cherry picked from commit 70ffff21) Signed-off-by:
Sean Owen <sowen@cloudera.com>
-
Tyson Condie authored
## What changes were proposed in this pull request? Instead of only keeping the minimum number of offsets around, we should keep enough information to allow us to roll back n batches and reexecute the stream starting from a given point. In particular, we should create a config in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and ensure that we keep enough log files in the following places to roll back the specified number of batches: the offsets that are present in each batch versions of the state store the files lists stored for the FileStreamSource the metadata log stored by the FileStreamSink marmbrus zsxwing ## How was this patch tested? The following tests were added. ### StreamExecution offset metadata Test added to StreamingQuerySuite that ensures offset metadata is garbage collected according to minBatchesRetain ### CompactibleFileStreamLog Tests added in CompactibleFileStreamLogSuite to ensure that logs are purged starting before the first compaction file that proceeds the current batch id - minBatchesToRetain. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Tyson Condie <tcondie@gmail.com> Closes #16219 from tcondie/offset_hist. (cherry picked from commit 83a42897) Signed-off-by:
Shixiong Zhu <shixiong@databricks.com>
-
- Dec 11, 2016
-
-
krishnakalyan3 authored
## What changes were proposed in this pull request? Updated Scala param and Python param to have quotes around the options making it easier for users to read. ## How was this patch tested? Manually checked the docstrings Author: krishnakalyan3 <krishnakalyan3@gmail.com> Closes #16242 from krishnakalyan3/doc-string. (cherry picked from commit c802ad87) Signed-off-by:
Sean Owen <sowen@cloudera.com>
-
Wenchen Fan authored
## What changes were proposed in this pull request? After https://github.com/apache/spark/pull/15620 , all of the Maven-based 2.0 Jenkins jobs time out consistently. As I pointed out in https://github.com/apache/spark/pull/15620#discussion_r91829129 , it seems that the regression test is an overkill and may hit constants pool size limitation, which is a known issue and hasn't been fixed yet. Since #15620 only fix the code size limitation problem, we can simplify the test to avoid hitting constants pool size limitation. ## How was this patch tested? test only change Author: Wenchen Fan <wenchen@databricks.com> Closes #16244 from cloud-fan/minor. (cherry picked from commit 9abd05b6) Signed-off-by:
Sean Owen <sowen@cloudera.com>
-
- Dec 10, 2016
-
-
wangzhenhua authored
[SPARK-18815][SQL] Fix NPE when collecting column stats for string/binary column having only null values ## What changes were proposed in this pull request? During column stats collection, average and max length will be null if a column of string/binary type has only null values. To fix this, I use default size when avg/max length is null. ## How was this patch tested? Add a test for handling null columns Author: wangzhenhua <wangzhenhua@huawei.com> Closes #16243 from wzhfy/nullStats. (cherry picked from commit a29ee55a) Signed-off-by:
Reynold Xin <rxin@databricks.com>
-
Michal Senkyr authored
## What changes were proposed in this pull request? The API documentation build was failing when using Java 8 due to incorrect character `>` in Javadoc. Replace `>` with literals in Javadoc to allow the build to pass. ## How was this patch tested? Documentation was built and inspected manually to ensure it still displays correctly in the browser ``` cd docs && jekyll serve ``` Author: Michal Senkyr <mike.senkyr@gmail.com> Closes #16201 from michalsenkyr/javadoc8-gt-fix. (cherry picked from commit 11432483) Signed-off-by:
Sean Owen <sowen@cloudera.com>
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? According to the notice of the following Wiki front page, we can remove the obsolete wiki pointer safely in `README.md` and `docs/index.md`, too. These two lines are the last occurrence of that links. ``` All current wiki content has been merged into pages at http://spark.apache.org as of November 2016. Each page links to the new location of its information on the Spark web site. Obsolete wiki content is still hosted here, but carries a notice that it is no longer current. ``` ## How was this patch tested? Manual. - `README.md`: https://github.com/dongjoon-hyun/spark/tree/remove_wiki_from_readme - `docs/index.md`: ``` cd docs SKIP_API=1 jekyll build ```  Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16239 from dongjoon-hyun/remove_wiki_from_readme. (cherry picked from commit f3a3fed7) Signed-off-by:
Sean Owen <sowen@cloudera.com>
-
Huaxin Gao authored
## What changes were proposed in this pull request? 1. In SparkStrategies.canBroadcast, I will add the check plan.statistics.sizeInBytes >= 0 2. In LocalRelations.statistics, when calculate the statistics, I will change the size to BigInt so it won't overflow. ## How was this patch tested? I will add a test case to make sure the statistics.sizeInBytes won't overflow. Author: Huaxin Gao <huaxing@us.ibm.com> Closes #16175 from huaxingao/spark-17460. (cherry picked from commit c5172568) Signed-off-by:
Wenchen Fan <wenchen@databricks.com>
-
Burak Yavuz authored
## What changes were proposed in this pull request? When you start a stream, if we are trying to resolve the source of the stream, for example if we need to resolve partition columns, this could take a long time. This long execution time should not block the main thread where `query.start()` was called on. It should happen in the stream execution thread possibly before starting any triggers. ## How was this patch tested? Unit test added. Made sure test fails with no code changes. Author: Burak Yavuz <brkyvz@gmail.com> Closes #16238 from brkyvz/SPARK-18811. (cherry picked from commit 63c91598) Signed-off-by:
Shixiong Zhu <shixiong@databricks.com>
-
- Dec 09, 2016
-
-
Felix Cheung authored
## What changes were proposed in this pull request? Several SparkR API calling into JVM methods that have void return values are getting printed out, especially when running in a REPL or IDE. example: ``` > setLogLevel("WARN") NULL ``` We should fix this to make the result more clear. Also found a small change to return value of dropTempView in 2.1 - adding doc and test for it. ## How was this patch tested? manually - I didn't find a expect_*() method in testthat for this Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16237 from felixcheung/rinvis. (cherry picked from commit 3e11d5bf) Signed-off-by:
Shivaram Venkataraman <shivaram@cs.berkeley.edu>
-
Xiangrui Meng authored
## What changes were proposed in this pull request? There has been some confusion around "Spark ML" vs. "MLlib". This PR adds some FAQ-like entries to the MLlib user guide to explain "Spark ML" and reduce the confusion. I check the [Spark FAQ page](http://spark.apache.org/faq.html ), which seems too high-level for the content here. So I added it to the MLlib user guide instead. cc: mateiz Author: Xiangrui Meng <meng@databricks.com> Closes #16241 from mengxr/SPARK-18812. (cherry picked from commit d2493a20) Signed-off-by:
Xiangrui Meng <meng@databricks.com>
-
Kazuaki Ishizaki authored
## What changes were proposed in this pull request? This PR avoids that a result of a cast `toInt` is negative due to signed integer overflow (e.g. 0x0000_0000_1???????L.toInt < 0 ). This PR performs casts after we can ensure the value is within range of signed integer (the result of `max(array.length, ???)` is always integer). ## How was this patch tested? Manually executed query68 of TPC-DS with 100TB Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #16235 from kiszk/SPARK-18745. (cherry picked from commit d60ab5fd) Signed-off-by:
Herman van Hovell <hvanhovell@databricks.com>
-
Shivaram Venkataraman authored
Fix SparkR package copy regex. The existing code leads to ``` Copying release tarballs to /home/****/public_html/spark-nightly/spark-branch-2.1-bin/spark-2.1.1-SNAPSHOT-2016_12_08_22_38-e8f351f9-bin mput: SparkR-*: no files found ``` Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #16231 from shivaram/typo-sparkr-build. (cherry picked from commit be5fc6ef) Signed-off-by:
Shivaram Venkataraman <shivaram@cs.berkeley.edu>
-
Xiangrui Meng authored
## What changes were proposed in this pull request? * This PR changes `JVMObjectTracker` from `object` to `class` and let its instance associated with each RBackend. So we can manage the lifecycle of JVM objects when there are multiple `RBackend` sessions. `RBackend.close` will clear the object tracker explicitly. * I assume that `SQLUtils` and `RRunner` do not need to track JVM instances, which could be wrong. * Small refactor of `SerDe.sqlSerDe` to increase readability. ## How was this patch tested? * Added unit tests for `JVMObjectTracker`. * Wait for Jenkins to run full tests. Author: Xiangrui Meng <meng@databricks.com> Closes #16154 from mengxr/SPARK-17822. (cherry picked from commit fd48d80a) Signed-off-by:
Xiangrui Meng <meng@databricks.com>
-
Jacek Laskowski authored
## What changes were proposed in this pull request? Typo fixes ## How was this patch tested? Local build. Awaiting the official build. Author: Jacek Laskowski <jacek@japila.pl> Closes #16144 from jaceklaskowski/typo-fixes. (cherry picked from commit b162cc0c) Signed-off-by:
Sean Owen <sowen@cloudera.com>
-
Zhan Zhang authored
Make stateful udf as nondeterministic Add new test cases with both Stateful and Stateless UDF. Without the patch, the test cases will throw exception: 1 did not equal 10 ScalaTestFailureLocation: org.apache.spark.sql.hive.execution.HiveUDFSuite$$anonfun$21 at (HiveUDFSuite.scala:501) org.scalatest.exceptions.TestFailedException: 1 did not equal 10 at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) ... Author: Zhan Zhang <zhanzhang@fb.com> Closes #16068 from zhzhan/state. (cherry picked from commit 67587d96) Signed-off-by:
Wenchen Fan <wenchen@databricks.com>
-
Felix Cheung authored
## What changes were proposed in this pull request? Copy pyspark and SparkR packages to latest release dir, as per comment [here](https://github.com/apache/spark/pull/16226#discussion_r91664822 ) Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16227 from felixcheung/pyrftp. (cherry picked from commit c074c96d) Signed-off-by:
Shivaram Venkataraman <shivaram@cs.berkeley.edu>
-
Shivaram Venkataraman authored
This PR adds a line in release-build.sh to copy the SparkR source archive using LFTP Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #16226 from shivaram/fix-sparkr-copy-build. (cherry picked from commit 934035ae) Signed-off-by:
Shivaram Venkataraman <shivaram@cs.berkeley.edu>
-
wm624@hotmail.com authored
## What changes were proposed in this pull request? In this PR, the document of `summary` method is improved in the format: returns summary information of the fitted model, which is a list. The list includes ....... Since `summary` in R is mainly about the model, which is not the same as `summary` object on scala side, if there is one, the scala API doc is not pointed here. In current document, some `return` have `.` and some don't have. `.` is added to missed ones. Since spark.logit `summary` has a big refactoring, this PR doesn't include this one. It will be changed when the `spark.logit` PR is merged. ## How was this patch tested? Manual build. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16150 from wangmiao1981/audit2. (cherry picked from commit 86a96034) Signed-off-by:
Felix Cheung <felixcheung@apache.org>
-
- Dec 08, 2016
-
-
Shivaram Venkataraman authored
[SPARKR][PYSPARK] Fix R source package name to match Spark version. Remove pip tar.gz from distribution ## What changes were proposed in this pull request? Fixes name of R source package so that the `cp` in release-build.sh works correctly. Issue discussed in https://github.com/apache/spark/pull/16014#issuecomment-265867125 Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #16221 from shivaram/fix-sparkr-release-build-name. (cherry picked from commit 4ac8b20b) Signed-off-by:
Shivaram Venkataraman <shivaram@cs.berkeley.edu>
-
Shixiong Zhu authored
## What changes were proposed in this pull request? Backport #16203 to branch 2.1. ## How was this patch tested? Jennkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16216 from zsxwing/SPARK-18774-2.1.
-
Tathagata Das authored
## What changes were proposed in this pull request? - Changed FileStreamSource to use new FileStreamSourceOffset rather than LongOffset. The field is named as `logOffset` to make it more clear that this is a offset in the file stream log. - Fixed bug in FileStreamSourceLog, the field endId in the FileStreamSourceLog.get(startId, endId) was not being used at all. No test caught it earlier. Only my updated tests caught it. Other minor changes - Dont use batchId in the FileStreamSource, as calling it batch id is extremely miss leading. With multiple sources, it may happen that a new batch has no new data from a file source. So offset of FileStreamSource != batchId after that batch. ## How was this patch tested? Updated unit test. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #16205 from tdas/SPARK-18776. (cherry picked from commit 458fa332) Signed-off-by:
Tathagata Das <tathagata.das1565@gmail.com>
-
Shivaram Venkataraman authored
This PR changes the SparkR source release tarball to be built using the Hadoop 2.6 profile. Previously it was using the without hadoop profile which leads to an error as discussed in https://github.com/apache/spark/pull/16014#issuecomment-265843991 Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #16218 from shivaram/fix-sparkr-release-build. (cherry picked from commit 202fcd21) Signed-off-by:
Shivaram Venkataraman <shivaram@cs.berkeley.edu>
-
Reynold Xin authored
## What changes were proposed in this pull request? This patch fixes the format specification in explain for file sources (Parquet and Text formats are the only two that are different from the rest): Before: ``` scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormatxyz, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string> ``` After: ``` scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string> ``` Also closes #14680. ## How was this patch tested? Verified in spark-shell. Author: Reynold Xin <rxin@databricks.com> Closes #16187 from rxin/SPARK-18760. (cherry picked from commit 5f894d23) Signed-off-by:
Reynold Xin <rxin@databricks.com>
-
Shixiong Zhu authored
## What changes were proposed in this pull request? When `SparkContext.stop` is called in `Utils.tryOrStopSparkContext` (the following three places), it will cause deadlock because the `stop` method needs to wait for the thread running `stop` to exit. - ContextCleaner.keepCleaning - LiveListenerBus.listenerThread.run - TaskSchedulerImpl.start This PR adds `SparkContext.stopInNewThread` and uses it to eliminate the potential deadlock. I also removed my changes in #15775 since they are not necessary now. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16178 from zsxwing/fix-stop-deadlock. (cherry picked from commit 26432df9) Signed-off-by:
Shixiong Zhu <shixiong@databricks.com>
-
Felix Cheung authored
This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not) But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below. This PR also includes a few minor fixes. These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md ) on what's going to a CRAN release, which is now run during make-distribution.sh. 1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path 2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation) 3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests) 4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1) (the output of this step is what we package into Spark dist and sparkr.zip) Alternatively, R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead. But in any case, despite installing the package multiple times this is relatively fast. Building vignettes takes a while though. Manually, CI. Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16014 from felixcheung/rdist. (cherry picked from commit c3d3a9d0) Signed-off-by:
Shivaram Venkataraman <shivaram@cs.berkeley.edu>
-