- Jul 13, 2016
-
-
oraviv authored
## What changes were proposed in this pull request? We have a use case of multiplying very big sparse matrices. we have about 1000x1000 distributed block matrices multiplication and the simulate multiply goes like O(n^4) (n being 1000). it takes about 1.5 hours. We modified it slightly with classical hashmap and now run in about 30 seconds O(n^2). ## How was this patch tested? We have added a performance test and verified the reduced time. Author: oraviv <oraviv@paypal.com> Closes #14068 from uzadude/master.
-
Sean Owen authored
## What changes were proposed in this pull request? Unpersist broadcasted vars in Word2Vec.fit for more timely / reliable resource cleanup ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #14153 from srowen/SPARK-16440.
-
sharkd authored
## What changes were proposed in this pull request? Fix code error in yarn-cluster unit test. ## How was this patch tested? Use exist tests Author: sharkd <sharkd.tu@gmail.com> Closes #14166 from sharkdtu/master.
-
sandy authored
## What changes were proposed in this pull request? Add Asynchronous Actions documentation inside action of programming guide ## How was this patch tested? check the documentation indentation and formatting with md preview. Author: sandy <phalodi@gmail.com> Closes #14104 from phalodi/SPARK-16438.
-
Maciej Brynski authored
## What changes were proposed in this pull request? Spark SQL UI display numbers greater than 1000 with u00A0 as grouping separator. Problem exists when server locale has no-breaking space as separator. (for example pl_PL) This patch turns off grouping and remove this separator. The problem starts with this PR. https://github.com/apache/spark/pull/12425/files#diff-803f475b01acfae1c5c96807c2ea9ddcR125 ## How was this patch tested? Manual UI tests. Screenshot attached.  Author: Maciej Brynski <maciej.brynski@adpilot.pl> Closes #14142 from maver1ck/master.
-
Xin Ren authored
## What changes were proposed in this pull request? Fix Java style errors and remove unused imports, which are randomly found ## How was this patch tested? Tested on my local machine. Author: Xin Ren <iamshrek@126.com> Closes #14161 from keypointt/SPARK-16437.
-
Alex Bozarth authored
## What changes were proposed in this pull request? I fixed a misassigned var, numCompletedTasks was assigned to numSkippedTasks in the convertJobData method ## How was this patch tested? dev/run-tests Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #14141 from ajbozarth/spark16375.
-
Sean Owen authored
## What changes were proposed in this pull request? A second form of AssertQuery now actually invokes the condition; avoids a build warning too ## How was this patch tested? Jenkins; running StreamTest Author: Sean Owen <sowen@cloudera.com> Closes #14133 from srowen/SPARK-15889.2.
-
aokolnychyi authored
- Hard-coded Spark SQL sample snippets were moved into source files under examples sub-project. - Removed the inconsistency between Scala and Java Spark SQL examples - Scala and Java Spark SQL examples were updated The work is still in progress. All involved examples were tested manually. An additional round of testing will be done after the code review.  Author: aokolnychyi <okolnychyyanton@gmail.com> Closes #14119 from aokolnychyi/spark_16303.
-
Eric Liang authored
## What changes were proposed in this pull request? RegexExtract and RegexReplace currently crash on non-nullable input due use of a hard-coded local variable name (e.g. compiles fail with `java.lang.Exception: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 85, Column 26: Redefinition of local variable "m" `). This changes those variables to use fresh names, and also in a few other places. ## How was this patch tested? Unit tests. rxin Author: Eric Liang <ekl@databricks.com> Closes #14168 from ericl/sc-3906.
-
- Jul 12, 2016
-
-
petermaxlee authored
## What changes were proposed in this pull request? This patch implements reflect SQL function, which can be used to invoke a Java method in SQL. Slightly different from Hive, this implementation requires the class name and the method name to be literals. This implementation also supports only a smaller number of data types, and requires the function to be static, as suggested by rxin in #13969. java_method is an alias for reflect, so this should also resolve SPARK-16277. ## How was this patch tested? Added expression unit tests and an end-to-end test. Author: petermaxlee <petermaxlee@gmail.com> Closes #14138 from petermaxlee/reflect-static.
-
Marcelo Vanzin authored
This option is used by Hive to directly delete the files instead of moving them to the trash. This is needed in certain configurations where moving the files does not work. For non-Hive tables and partitions, Spark already behaves as if the PURGE option was set, so there's no need to do anything. Hive support for PURGE was added in 0.14 (for tables) and 1.2 (for partitions), so the code reflects that: trying to use the option with older versions of Hive will cause an exception to be thrown. The change is a little noisier than I would like, because of the code to propagate the new flag through all the interfaces and implementations; the main changes are in the parser and in HiveShim, aside from the tests (DDLCommandSuite, VersionsSuite). Tested by running sql and catalyst unit tests, plus VersionsSuite which has been updated to test the version-specific behavior. I also ran an internal test suite that uses PURGE and would not pass previously. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #13831 from vanzin/SPARK-16119.
-
Yangyang Liu authored
## What changes were proposed in this pull request? Since externalShuffleService is essential for spark, better monitoring for shuffle service is necessary. In order to do so, we added various metrics in shuffle service and imported into ExternalShuffleServiceSource for metric system. Metrics added in shuffle service: * registeredExecutorsSize * openBlockRequestLatencyMillis * registerExecutorRequestLatencyMillis * blockTransferRateBytes JIRA Issue: https://issues.apache.org/jira/browse/SPARK-16405 ## How was this patch tested? Some test cases are added to verify metrics as expected in metric system. Those unit test cases are shown in `ExternalShuffleBlockHandlerSuite ` Author: Yangyang Liu <yangyangliu@fb.com> Closes #14080 from lovexi/yangyang-metrics.
-
sharkd authored
[SPARK-16414][YARN] Fix bugs for "Can not get user config when calling SparkHadoopUtil.get.conf on yarn cluser mode" ## What changes were proposed in this pull request? The `SparkHadoopUtil` singleton was instantiated before `ApplicationMaster` in `ApplicationMaster.main` when deploying spark on yarn cluster mode, the `conf` in the `SparkHadoopUtil` singleton didn't include user's configuration. So, we should load the properties file with the Spark configuration and set entries as system properties before `SparkHadoopUtil` first instantiate. ## How was this patch tested? Add a test case Author: sharkd <sharkd.tu@gmail.com> Author: sharkdtu <sharkdtu@tencent.com> Closes #14088 from sharkdtu/master.
-
Reynold Xin authored
## What changes were proposed in this pull request? In code generation, it is incorrect for expressions to reuse variable names across different instances of itself. As an example, SPARK-16488 reports a bug in which pmod expression reuses variable name "r". This patch updates ExpressionEvalHelper test harness to always project two instances of the same expression, which will help us catch variable reuse problems in expression unit tests. This patch also fixes the bug in crc32 expression. ## How was this patch tested? This is a test harness change, but I also created a new test suite for testing the test harness. Author: Reynold Xin <rxin@databricks.com> Closes #14146 from rxin/SPARK-16489.
-
Lianhui Wang authored
[SPARK-15752][SQL] Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators. ## What changes were proposed in this pull request? when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003. ## How was this patch tested? add unit tests Author: Lianhui Wang <lianhuiwang09@gmail.com> Author: Wenchen Fan <wenchen@databricks.com> Author: Lianhui Wang <lianhuiwang@users.noreply.github.com> Closes #13494 from lianhuiwang/metadata-only.
-
WeichenXu authored
[SPARK-16470][ML][OPTIMIZER] Check linear regression training whether actually reach convergence and add warning if not ## What changes were proposed in this pull request? In `ml.regression.LinearRegression`, it use breeze `LBFGS` and `OWLQN` optimizer to do data training, but do not check whether breeze's optimizer returned result actually reached convergence. The `LBFGS` and `OWLQN` optimizer in breeze finish iteration may result the following situations: 1) reach max iteration number 2) function reach value convergence 3) objective function stop improving 4) gradient reach convergence 5) search failed(due to some internal numerical error) I add warning printing code so that if the iteration result is (1) or (3) or (5) in above, it will print a warning with respective reason string. ## How was this patch tested? Manual. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14122 from WeichenXu123/add_lr_not_convergence_warn.
-
Takuya UESHIN authored
[SPARK-16189][SQL] Add ExternalRDD logical plan for input with RDD to have a chance to eliminate serialize/deserialize. ## What changes were proposed in this pull request? Currently the input `RDD` of `Dataset` is always serialized to `RDD[InternalRow]` prior to being as `Dataset`, but there is a case that we use `map` or `mapPartitions` just after converted to `Dataset`. In this case, serialize and then deserialize happens but it would not be needed. This pr adds `ExistingRDD` logical plan for input with `RDD` to have a chance to eliminate serialize/deserialize. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #13890 from ueshin/issues/SPARK-16189.
-
WeichenXu authored
## What changes were proposed in this pull request? In `train` method of `ml.regression.LinearRegression` when handling situation `std(label) == 0` the code replace `std(label)` with `mean(label)` but the relative comment is inconsistent, I update it. ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #14121 from WeichenXu123/update_lr_comment.
-
petermaxlee authored
## What changes were proposed in this pull request? It would be useful to support listing the columns that are referenced by a filter. This can help simplify data source planning, because with this we would be able to implement unhandledFilters method in HadoopFsRelation. This is based on rxin's patch (#13901) and adds unit tests. ## How was this patch tested? Added a new suite FiltersSuite. Author: petermaxlee <petermaxlee@gmail.com> Author: Reynold Xin <rxin@databricks.com> Closes #14120 from petermaxlee/SPARK-16199.
-
- Jul 11, 2016
-
-
Russell Spitzer authored
## What changes were proposed in this pull request? In order to make it clear which filters are fully handled by the underlying datasource we will mark them with an *. This will give a clear visual queue to users that the filter is being treated differently by catalyst than filters which are just presented to the underlying DataSource. Examples from the FilteredScanSuite, in this example `c IN (...)` is handled by the source, `b < ...` is not ### Before ``` //SELECT a FROM oneToTenFiltered WHERE a + b > 9 AND b < 16 AND c IN ('bbbbbBBBBB', 'cccccCCCCC', 'dddddDDDDD', 'foo') == Physical Plan == Project [a#0] +- Filter (((a#0 + b#1) > 9) && (b#1 < 16)) +- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]] ``` ### After ``` == Physical Plan == Project [a#0] +- Filter (((a#0 + b#1) > 9) && (b#1 < 16)) +- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), *In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]] ``` ## How was the this patch tested? Manually tested with the Spark Cassandra Connector, a source which fully handles underlying filters. Now fully handled filters appear with an * next to their names. I can add an automated test as well if requested Post 1.6.1 Tested by modifying the FilteredScanSuite to run explains. Author: Russell Spitzer <Russell.Spitzer@gmail.com> Closes #11317 from RussellSpitzer/SPARK-12639-Star.
-
Sameer Agarwal authored
## What changes were proposed in this pull request? This patch fixes a variable namespace collision bug in pmod and partitionBy ## How was this patch tested? Regression test for one possible occurrence. A more general fix in `ExpressionEvalHelper.checkEvaluation` will be in a subsequent PR. Author: Sameer Agarwal <sameer@databricks.com> Closes #14144 from sameeragarwal/codegen-bug.
-
Tathagata Das authored
## What changes were proposed in this pull request? Incorrect list of files were being allocated to a batch. This caused a file to read multiple times in the multiple batches. ## How was this patch tested? Added unit tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #14143 from tdas/SPARK-16430-1.
-
Shixiong Zhu authored
## What changes were proposed in this pull request? Display `No physical plan. Waiting for data.` instead of `N/A` for StreamingQuery.explain when no data arrives because `N/A` doesn't provide meaningful information. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #14100 from zsxwing/SPARK-16433.
-
Xin Ren authored
## What changes were proposed in this pull request? Some minor changes for documentation page "Spark Streaming + Kinesis Integration". Moved "streaming-kinesis-arch.png" before the bullet list, not in between the bullets. ## How was this patch tested? Tested manually, on my local machine. Author: Xin Ren <iamshrek@126.com> Closes #14097 from keypointt/kinesisDoc.
-
James Thomas authored
## What changes were proposed in this pull request? A structured streaming example with event time windowing. ## How was this patch tested? Run locally Author: James Thomas <jamesjoethomas@gmail.com> Closes #13957 from jjthomas/current.
-
Marcelo Vanzin authored
Some Hadoop classes needed by the Hive metastore client jars are not present in Spark's packaging (for example, "org/apache/hadoop/mapred/MRVersion"). So if the parent class loader fails to find a class, try to load it from the isolated class loader, in case it's available there. Tested by setting spark.sql.hive.metastore.jars to local paths with Hive/Hadoop libraries and verifying that Spark can talk to the metastore. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #14020 from vanzin/SPARK-16349.
-
Felix Cheung authored
## What changes were proposed in this pull request? From SPARK-16140/PR #13921 - the issue is we left write.ml doc empty:  Here's what I meant as the fix:   I didn't realize there was already a JIRA on this. mengxr yanboliang ## How was this patch tested? check doc generated. Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13993 from felixcheung/rmllibdoc.
-
Yanbo Liang authored
## What changes were proposed in this pull request? * Update SparkR ML section to make them consistent with SparkR API docs. * Since #13972 adds labelling support for the ```include_example``` Jekyll plugin, so that we can split the single ```ml.R``` example file into multiple line blocks with different labels, and include them in different algorithms/models in the generated HTML page. ## How was this patch tested? Only docs update, manually check the generated docs. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14011 from yanboliang/r-user-guide-update.
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? Temporary tables are used frequently, but `spark.catalog.listColumns` does not support those tables. This PR make `SessionCatalog` supports temporary table column listing. **Before** ```scala scala> spark.range(10).createOrReplaceTempView("t1") scala> spark.catalog.listTables().collect() res1: Array[org.apache.spark.sql.catalog.Table] = Array(Table[name=`t1`, tableType=`TEMPORARY`, isTemporary=`true`]) scala> spark.catalog.listColumns("t1").collect() org.apache.spark.sql.AnalysisException: Table `t1` does not exist in database `default`.; ``` **After** ``` scala> spark.catalog.listColumns("t1").collect() res2: Array[org.apache.spark.sql.catalog.Column] = Array(Column[name='id', description='id', dataType='bigint', nullable='false', isPartition='false', isBucket='false']) ``` ## How was this patch tested? Pass the Jenkins tests including a new testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14114 from dongjoon-hyun/SPARK-16458.
-
Reynold Xin authored
## What changes were proposed in this pull request? After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #14130 from rxin/SPARK-16477.
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? This PR prevents dropping the current database to avoid errors like the followings. ```scala scala> sql("create database delete_db") scala> sql("use delete_db") scala> sql("drop database delete_db") scala> sql("create table t as select 1") org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database `delete_db` not found; ``` ## How was this patch tested? Pass the Jenkins tests including an updated testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14115 from dongjoon-hyun/SPARK-16459.
-
Xin Ren authored
https://issues.apache.org/jira/browse/SPARK-16381 ## What changes were proposed in this pull request? Update SQL examples and programming guide for R language binding. Here I just follow example https://github.com/apache/spark/compare/master...liancheng:example-snippet-extraction, created a separate R file to store all the example code. ## How was this patch tested? Manual test on my local machine. Screenshot as below:  Author: Xin Ren <iamshrek@126.com> Closes #14082 from keypointt/SPARK-16381.
-
gatorsmile authored
#### What changes were proposed in this pull request? **Issue 1:** When a query containing LIMIT/TABLESAMPLE 0, the statistics could be zero. Results are correct but it could cause a huge performance regression. For example, ```Scala Seq(("one", 1), ("two", 2), ("three", 3), ("four", 4)).toDF("k", "v") .createOrReplaceTempView("test") val df1 = spark.table("test") val df2 = spark.table("test").limit(0) val df = df1.join(df2, Seq("k"), "left") ``` The statistics of both `df` and `df2` are zero. The statistics values should never be zero; otherwise `sizeInBytes` of `BinaryNode` will also be zero (product of children). This PR is to increase it to `1` when the num of rows is equal to 0. **Issue 2:** When a query containing negative LIMIT/TABLESAMPLE, we should issue exceptions. Negative values could break the implementation assumption of multiple parts. For example, statistics calculation. Below is the example query. ```SQL SELECT * FROM testData TABLESAMPLE (-1 rows) SELECT * FROM testData LIMIT -1 ``` This PR is to issue an appropriate exception in this case. **Issue 3:** Spark SQL follows the restriction of LIMIT clause in Hive. The argument to the LIMIT clause must evaluate to a constant value. It can be a numeric literal, or another kind of numeric expression involving operators, casts, and function return values. You cannot refer to a column or use a subquery. Currently, we do not detect whether the expression in LIMIT clause is foldable or not. If non-foldable, we might issue a strange error message. For example, ```SQL SELECT * FROM testData LIMIT rand() > 0.2 ``` Then, a misleading error message is issued, like ``` assertion failed: No plan for GlobalLimit (_nondeterministic#203 > 0.2) +- Project [key#11, value#12, rand(-1441968339187861415) AS _nondeterministic#203] +- LocalLimit (_nondeterministic#202 > 0.2) +- Project [key#11, value#12, rand(-1308350387169017676) AS _nondeterministic#202] +- LogicalRDD [key#11, value#12] java.lang.AssertionError: assertion failed: No plan for GlobalLimit (_nondeterministic#203 > 0.2) +- Project [key#11, value#12, rand(-1441968339187861415) AS _nondeterministic#203] +- LocalLimit (_nondeterministic#202 > 0.2) +- Project [key#11, value#12, rand(-1308350387169017676) AS _nondeterministic#202] +- LogicalRDD [key#11, value#12] ``` This PR detects it and then issues a meaningful error message. #### How was this patch tested? Added test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #14034 from gatorsmile/limit.
-
petermaxlee authored
## What changes were proposed in this pull request? This patch implements all remaining xpath functions that Hive supports and not natively supported in Spark: xpath_int, xpath_short, xpath_long, xpath_float, xpath_double, xpath_string, and xpath. ## How was this patch tested? Added unit tests and end-to-end tests. Author: petermaxlee <petermaxlee@gmail.com> Closes #13991 from petermaxlee/SPARK-16318.
-
Reynold Xin authored
## What changes were proposed in this pull request? It is currently fairly difficult to have proper mima excludes when we cut a version branch. I'm proposing a small change to take the exclude list out of the exclude function, and put it in a variable so we can easily union excludes. After this change, we can bump pom.xml version to 2.1.0-SNAPSHOT, without bumping the diff base version. Note that I also deleted all the exclude rules for version 1.x, to cut down the size of the file. ## How was this patch tested? N/A - this is a build infra change. Author: Reynold Xin <rxin@databricks.com> Closes #14128 from rxin/SPARK-16476.
-
- Jul 10, 2016
-
-
Kazuaki Ishizaki authored
## What changes were proposed in this pull request? This PR updates version of Janino compiler from 2.7.8 to 3.0.0. This version fixes [an Janino issue](https://github.com/janino-compiler/janino/issues/1) that fixes [an issue](https://issues.apache.org/jira/browse/SPARK-15467), which throws Java exception, in Spark. ## How was this patch tested? Manually tested using a program in [the JIRA entry](https://issues.apache.org/jira/browse/SPARK-15467) Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #14127 from kiszk/SPARK-15467.
-
- Jul 09, 2016
-
-
gatorsmile authored
[SPARK-16401][SQL] Data Source API: Enable Extending RelationProvider and CreatableRelationProvider without Extending SchemaRelationProvider #### What changes were proposed in this pull request? When users try to implement a data source API with extending only `RelationProvider` and `CreatableRelationProvider`, they will hit an error when resolving the relation. ```Scala spark.read .format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema") .load() .write. format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema") .save() ``` The error they hit is like ``` org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.; org.apache.spark.sql.AnalysisException: org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.; at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:319) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:494) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) ``` Actually, the bug fix is simple. [`DataSource.createRelation(sparkSession.sqlContext, mode, options, data)`](https://github.com/gatorsmile/spark/blob/dd644f8117e889cebd6caca58702a7c7e3d88bef/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L429) already returns a BaseRelation. We should not assign schema to `userSpecifiedSchema`. That schema assignment only makes sense for the data sources that extend `FileFormat`. #### How was this patch tested? Added a test case. Author: gatorsmile <gatorsmile@gmail.com> Closes #14075 from gatorsmile/dataSource.
-
- Jul 08, 2016
-
-
Michael Gummelt authored
## What changes were proposed in this pull request? Documentation changes to indicate that fine-grained mode is now deprecated. No code changes were made, and all fine-grained mode instructions were left in place. We can remove all of that once the deprecation cycle completes (Does Spark have a standard deprecation cycle? One major version?) Blocked on https://github.com/apache/spark/pull/14059 ## How was this patch tested? Viewed in Github Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14078 from mgummelt/deprecate-fine-grained.
-
Eric Liang authored
## What changes were proposed in this pull request? It's possible to also change the callers to not pass in empty chunks, but it seems cleaner to just allow `ChunkedByteBuffer` to handle empty arrays. cc JoshRosen ## How was this patch tested? Unit tests, also checked that the original reproduction case in https://github.com/apache/spark/pull/11748#issuecomment-230760283 is resolved. Author: Eric Liang <ekl@databricks.com> Closes #14099 from ericl/spark-16432.
-