Commits · 6f98902a3d7749e543bc493a8c62b1e3a7b924cc · cs525-sp18-g07 / spark

Oct 09, 2014

[SPARK-3834][SQL] Backticks not correctly handled in subquery aliases · 6f98902a

ravipesala authored 10 years ago

The queries like SELECT a.key FROM (SELECT key FROM src) \`a\` does not work as backticks in subquery aliases are not handled properly. This PR fixes that.

Author : ravipesala ravindra.pesalahuawei.com

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #2737 from ravipesala/SPARK-3834 and squashes the following commits:

0e0ab98 [ravipesala] Fixing issue in backtick handling for subquery aliases

6f98902a

[SPARK-3824][SQL] Sets in-memory table default storage level to MEMORY_AND_DISK · 421382d0

Cheng Lian authored 10 years ago

Using `MEMORY_AND_DISK` as default storage level for in-memory table caching. Due to the in-memory columnar representation, recomputing an in-memory cached table partitions can be very expensive.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2686 from liancheng/spark-3824 and squashes the following commits:

35d2ed0 [Cheng Lian] Removes extra space
1ab7967 [Cheng Lian] Reduces test data size to fit DiskStore.getBytes()
ba565f0 [Cheng Lian] Maks CachedBatch serializable
07f0204 [Cheng Lian] Sets in-memory table default storage level to MEMORY_AND_DISK

421382d0

[SPARK-3654][SQL] Unifies SQL and HiveQL parsers · edf02da3

Cheng Lian authored 10 years ago

This PR is a follow up of #2590, and tries to introduce a top level SQL parser entry point for all SQL dialects supported by Spark SQL.

A top level parser `SparkSQLParser` is introduced to handle the syntaxes that all SQL dialects should recognize (e.g. `CACHE TABLE`, `UNCACHE TABLE` and `SET`, etc.). For all the syntaxes this parser doesn't recognize directly, it fallbacks to a specified function that tries to parse arbitrary input to a `LogicalPlan`. This function is typically another parser combinator like `SqlParser`. DDL syntaxes introduced in #2475 can be moved to here.

The `ExtendedHiveQlParser` now only handle Hive specific extensions.

Also took the chance to refactor/reformat `SqlParser` for better readability.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2698 from liancheng/gen-sql-parser and squashes the following commits:

ceada76 [Cheng Lian] Minor styling fixes
9738934 [Cheng Lian] Minor refactoring, removes optional trailing ";" in the parser
bb2ab12 [Cheng Lian] SET property value can be empty string
ce8860b [Cheng Lian] Passes test suites
e86968e [Cheng Lian] Removes debugging code
8bcace5 [Cheng Lian] Replaces digit.+ to rep1(digit) (Scala style checking doesn't like it)
d15d54f [Cheng Lian] Unifies SQL and HiveQL parsers

edf02da3

SPARK-3811 [CORE] More robust / standard Utils.deleteRecursively, Utils.createTempDir · 363baaca

Sean Owen authored 10 years ago

I noticed a few issues with how temp directories are created and deleted:

*Minor*

* Guava's `Files.createTempDir()` plus `File.deleteOnExit()` is used in many tests to make a temp dir, but `Utils.createTempDir()` seems to be the standard Spark mechanism
* Call to `File.deleteOnExit()` could be pushed into `Utils.createTempDir()` as well, along with this replacement
* _I messed up the message in an exception in `Utils` in SPARK-3794; fixed here_

*Bit Less Minor*

* `Utils.deleteRecursively()` fails immediately if any `IOException` occurs, instead of trying to delete any remaining files and subdirectories. I've observed this leave temp dirs around. I suggest changing it to continue in the face of an exception and throw one of the possibly several exceptions that occur at the end.
* `Utils.createTempDir()` will add a JVM shutdown hook every time the method is called. Even if the subdir is the parent of another parent dir, since this check is inside the hook. However `Utils` manages a set of all dirs to delete on shutdown already, called `shutdownDeletePaths`. A single hook can be registered to delete all of these on exit. This is how Tachyon temp paths are cleaned up in `TachyonBlockManager`.

I noticed a few other things that might be changed but wanted to ask first:

* Shouldn't the set of dirs to delete be `File`, not just `String` paths?
* `Utils` manages the set of `TachyonFile` that have been registered for deletion, but the shutdown hook is managed in `TachyonBlockManager`. Should this logic not live together, and not in `Utils`? it's more specific to Tachyon, and looks a slight bit odd to import in such a generic place.

Author: Sean Owen <sowen@cloudera.com>

Closes #2670 from srowen/SPARK-3811 and squashes the following commits:

071ae60 [Sean Owen] Update per @vanzin's review
da0146d [Sean Owen] Make Utils.deleteRecursively try to delete all paths even when an exception occurs; use one shutdown hook instead of one per method call to delete temp dirs
3a0faa4 [Sean Owen] Standardize on Utils.createTempDir instead of Files.createTempDir

363baaca

[SPARK-3798][SQL] Store the output of a generator in a val · 2837bf85

Michael Armbrust authored 10 years ago

This prevents it from changing during serialization, leading to corrupted results.

Author: Michael Armbrust <michael@databricks.com>

Closes #2656 from marmbrus/generateBug and squashes the following commits:

efa32eb [Michael Armbrust] Store the output of a generator in a val. This prevents it from changing during serialization.

2837bf85

[SPARK-3772] Allow `ipython` to be used by Pyspark workers; IPython support improvements: · 4e9b551a

Josh Rosen authored 10 years ago

This pull request addresses a few issues related to PySpark's IPython support:

- Fix the remaining uses of the '-u' flag, which IPython doesn't support (see SPARK-3772).
- Change PYSPARK_PYTHON_OPTS to PYSPARK_DRIVER_PYTHON_OPTS, so that the old name is reserved in case we ever want to allow the worker Python options to be customized (this variable was introduced in #2554 and hasn't landed in a release yet, so this doesn't break any compatibility).
- Introduce a PYSPARK_DRIVER_PYTHON option that allows the driver to use `ipython` while the workers use a different Python version.
- Attempt to use Python 2.7 by default if PYSPARK_PYTHON is not specified.
- Retain the old semantics for IPYTHON=1 and IPYTHON_OPTS (to avoid breaking existing example programs).

There are more details in a block comment in `bin/pyspark`.

Author: Josh Rosen <joshrosen@apache.org>

Closes #2651 from JoshRosen/SPARK-3772 and squashes the following commits:

7b8eb86 [Josh Rosen] More changes to PySpark python executable configuration:
c4f5778 [Josh Rosen] [SPARK-3772] Allow ipython to be used by Pyspark workers; IPython fixes:

4e9b551a

[SPARK-3813][SQL] Support "case when" conditional functions in Spark SQL. · ac302052

ravipesala authored 10 years ago

"case when" conditional function is already supported in Spark SQL but there is no support in SqlParser. So added parser support to it.

Author : ravipesala ravindra.pesalahuawei.com

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #2678 from ravipesala/SPARK-3813 and squashes the following commits:

70c75a7 [ravipesala] Fixed styles
713ea84 [ravipesala] Updated as per admin comments
709684f [ravipesala] Changed parser to support case when function.

ac302052

[SPARK-3858][SQL] Pass the generator alias into logical plan node · bc3b6cb0

Nathan Howell authored 10 years ago

The alias parameter is being ignored, which makes it more difficult to specify a qualifier for Generator expressions.

Author: Nathan Howell <nhowell@godaddy.com>

Closes #2721 from NathanHowell/SPARK-3858 and squashes the following commits:

8aa0f43 [Nathan Howell] [SPARK-3858][SQL] Pass the generator alias into logical plan node

bc3b6cb0

[SPARK-3412][SQL]add missing row api · 0c0e09f5

Daoyuan Wang authored 10 years ago

chenghao-intel assigned this to me, check PR #2284 for previous discussion

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #2529 from adrian-wang/rowapi and squashes the following commits:

c6594b2 [Daoyuan Wang] using boxed
7b7e6e3 [Daoyuan Wang] update pattern match
7a39456 [Daoyuan Wang] rename file and refresh getAs[T]
4c18c29 [Daoyuan Wang] remove setAs[T] and null judge
1614493 [Daoyuan Wang] add missing row api

0c0e09f5

[SPARK-3339][SQL] Support for skipping json lines that fail to parse · 1c7f0ab3

Yin Huai authored 10 years ago

This PR aims to provide a way to skip/query corrupt JSON records. To do so, we introduce an internal column to hold corrupt records (the default name is `_corrupt_record`. This name can be changed by setting the value of `spark.sql.columnNameOfCorruptRecord`). When there is a parsing error, we will put the corrupt record in its unparsed format to the internal column. Users can skip/query this column through SQL.

* To query those corrupt records
```
-- For Hive parser
SELECT `_corrupt_record`
FROM jsonTable
WHERE `_corrupt_record` IS NOT NULL
-- For our SQL parser
SELECT _corrupt_record
FROM jsonTable
WHERE _corrupt_record IS NOT NULL
```
* To skip corrupt records and query regular records
```
-- For Hive parser
SELECT field1, field2
FROM jsonTable
WHERE `_corrupt_record` IS NULL
-- For our SQL parser
SELECT field1, field2
FROM jsonTable
WHERE _corrupt_record IS NULL
```

Generally, it is not recommended to change the name of the internal column. If the name has to be changed to avoid possible name conflicts, you can use `sqlContext.setConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD, <new column name>)` or `sqlContext.sql(SET spark.sql.columnNameOfCorruptRecord=<new column name>)`.

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #2680 from yhuai/corruptJsonRecord and squashes the following commits:

4c9828e [Yin Huai] Merge remote-tracking branch 'upstream/master' into corruptJsonRecord
309616a [Yin Huai] Change the default name of corrupt record to "_corrupt_record".
b4a3632 [Yin Huai] Merge remote-tracking branch 'upstream/master' into corruptJsonRecord
9375ae9 [Yin Huai] Set the column name of corrupt json record back to the default one after the unit test.
ee584c0 [Yin Huai] Provide a way to query corrupt json records as unparsed strings.

1c7f0ab3

Revert "[SPARK-2805] Upgrade to akka 2.3.4" · 1faa1135
Patrick Wendell authored 10 years ago
```
This reverts commit b9df8af6.
```
1faa1135

[SPARK-3853][SQL] JSON Schema support for Timestamp fields · ec4d40e4

Mike Timper authored 10 years ago

In JSONRDD.scala, add 'case TimestampType' in the enforceCorrectType function and a toTimestamp function.

Author: Mike Timper <mike@aurorafeint.com>

Closes #2720 from mtimper/master and squashes the following commits:

9386ab8 [Mike Timper] Fix and tests for SPARK-3853

ec4d40e4

[SPARK-3868][PySpark] Hard to recognize which module is tested from unit-tests.log · e7edb723

cocoatomo authored 10 years ago

./python/run-tests script display messages about which test it is running currently on stdout but not write them on unit-tests.log.
It is harder for us to recognize what test programs were executed and which test was failed.

Author: cocoatomo <cocoatomo77@gmail.com>

Closes #2724 from cocoatomo/issues/3868-display-testing-module-name and squashes the following commits:

c63d9fa [cocoatomo] [SPARK-3868][PySpark] Hard to recognize which module is tested from unit-tests.log

e7edb723

[SPARK-3806][SQL] Minor fix for CliSuite · 2c885134

scwf authored 10 years ago

To fix two issues in CliSuite
1 CliSuite throw IndexOutOfBoundsException:
Exception in thread "Thread-6" java.lang.IndexOutOfBoundsException: 6
at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
at org.apache.spark.sql.hive.thriftserver.CliSuite.org$apache$spark$sql$hive$thriftserver$CliSuite$$captureOutput$1(CliSuite.scala:67)
at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
at scala.sys.process.ProcessLogger$$anon$1.out(ProcessLogger.scala:96)
at scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
at scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:175)
at scala.sys.process.BasicIO$.processLinesFully(BasicIO.scala:179)
at scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:164)
at scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:162)
at scala.sys.process.ProcessBuilderImpl$Simple$$anonfun$3.apply$mcV$sp(ProcessBuilderImpl.scala:73)
at scala.sys.process.ProcessImpl$Spawn$$anon$1.run(ProcessImpl.scala:22)

Actually, it is the Mutil-Threads lead to this problem.

2 Using ```line.startsWith``` instead ```line.contains``` to assert expected answer. This is a tiny bug in CliSuite, for test case "Simple commands", there is a expected answers "5", if we use ```contains``` that means output like "14/10/06 11:```5```4:36 INFO CliDriver: Time taken: 1.078 seconds" or "14/10/06 11:54:36 INFO StatsReportListener: 0% ```5```% 10% 25% 50% 75% 90% 95% 100%" will make the assert true.

Author: scwf <wangfei1@huawei.com>

Closes #2666 from scwf/clisuite and squashes the following commits:

11430db [scwf] fix-clisuite

2c885134

[SPARK-3711][SQL] Optimize where in clause filter queries · 752e90f1

Yash Datta authored 10 years ago

The In case class is replaced by a InSet class in case all the filters are literals, which uses a hashset instead of Sequence, thereby giving significant performance improvement (earlier the seq was using a worst case linear match (exists method) since expressions were assumed in the filter list) . Maximum improvement should be visible in case small percentage of large data matches the filter list.

Author: Yash Datta <Yash.Datta@guavus.com>

Closes #2561 from saucam/branch-1.1 and squashes the following commits:

4bf2d19 [Yash Datta] SPARK-3711: 1. Fix code style and import order 2. Fix optimization condition 3. Add tests for null in filter list 4. Add test case that optimization is not triggered in case of attributes in filter list
afedbcd [Yash Datta] SPARK-3711: 1. Add test cases for InSet class in ExpressionEvaluationSuite 2. Add class OptimizedInSuite on the lines of ConstantFoldingSuite, for the optimized In clause
0fc902f [Yash Datta] SPARK-3711: UnaryMinus will be handled by constantFolding
bd84c67 [Yash Datta] SPARK-3711: Incorporate review comments. Move optimization of In clause to Optimizer.scala by adding a rule. Add appropriate comments
430f5d1 [Yash Datta] SPARK-3711: Optimize the filter list in case of negative values as well
bee98aa [Yash Datta] SPARK-3711: Optimize where in clause filter queries

752e90f1

[SPARK-3752][SQL]: Add tests for different UDF's · b77a02f4

Vida Ha authored 10 years ago

Author: Vida Ha <vida@databricks.com>

Closes #2621 from vidaha/vida/SPARK-3752 and squashes the following commits:

d7fdbbc [Vida Ha] Add tests for different UDF's

b77a02f4

[SPARK-3741] Make ConnectionManager propagate errors properly and add mo... · 73bf3f2e

zsxwing authored 10 years ago

...re logs to avoid Executors swallowing errors

This PR made the following changes:
* Register a callback to `Connection` so that the error will be propagated properly.
* Add more logs so that the errors won't be swallowed by Executors.
* Use trySuccess/tryFailure because `Promise` doesn't allow to call success/failure more than once.

Author: zsxwing <zsxwing@gmail.com>

Closes #2593 from zsxwing/SPARK-3741 and squashes the following commits:

1d5aed5 [zsxwing] Fix naming
0b8a61c [zsxwing] Merge branch 'master' into SPARK-3741
764aec5 [zsxwing] [SPARK-3741] Make ConnectionManager propagate errors properly and add more logs to avoid Executors swallowing errors

73bf3f2e

[Minor] use norm operator after breeze 0.10 upgrade · 1e0aa4de

GuoQiang Li authored 10 years ago

cc mengxr

Author: GuoQiang Li <witgo@qq.com>

Closes #2730 from witgo/SPARK-3856 and squashes the following commits:

2cffce1 [GuoQiang Li] use norm operator after breeze 0.10 upgrade

1e0aa4de

[SPARK-3158][MLLIB]Avoid 1 extra aggregation for DecisionTree training · 14f222f7

Qiping Li authored 10 years ago

Currently, the implementation does one unnecessary aggregation step. The aggregation step for level L (to choose splits) gives enough information to set the predictions of any leaf nodes at level L+1. We can use that info and skip the aggregation step for the last level of the tree (which only has leaf nodes).

### Implementation Details

Each node now has a `impurity` field and the `predict` is changed from type `Double` to type `Predict`(this can be used to compute predict probability in the future) When compute best splits for each node, we also compute impurity and predict for the child nodes, which is used to constructed newly allocated child nodes. So at level L, we have set impurity and predict for nodes at level L +1.
If level L+1 is the last level, then we can avoid aggregation. What's more, calculation of parent impurity in

Top nodes for each tree needs to be treated differently because we have to compute impurity and predict for them first. In `binsToBestSplit`, if current node is top node(level == 0), we calculate impurity and predict first.
after finding best split, top node's predict and impurity is set to the calculated value. Non-top nodes's impurity and predict are already calculated and don't need to be recalculated again. I have considered to add a initialization step to set top nodes' impurity and predict and then we can treat all nodes in the same way, but this will need a lot of duplication of code(all the code to do seq operation(BinSeqOp) needs to be duplicated), so I choose the current way.

CC mengxr manishamde jkbradley, please help me review this, thanks.

Author: Qiping Li <liqiping1991@gmail.com>

Closes #2708 from chouqin/avoid-agg and squashes the following commits:

8e269ea [Qiping Li] adjust code and comments
eefeef1 [Qiping Li] adjust comments and check child nodes' impurity
c41b1b6 [Qiping Li] fix pyspark unit test
7ad7a71 [Qiping Li] fix unit test
822c912 [Qiping Li] add comments and unit test
e41d715 [Qiping Li] fix bug in test suite
6cc0333 [Qiping Li] SPARK-3158: Avoid 1 extra aggregation for DecisionTree training

14f222f7

add spark.driver.memory to config docs · 13cab5ba

nartz authored 10 years ago

It took me a minute to track this down, so I thought it could be useful to have it in the docs.

I'm unsure if 512mb is the default for spark.driver.memory? Also - there could be a better value for the 'description' to differentiate it from spark.executor.memory.

Author: nartz <nartzpod@gmail.com>
Author: Nathan Artz <nathanartz@Nathans-MacBook-Pro.local>

Closes #2410 from nartz/docs/add-spark-driver-memory-to-config-docs and squashes the following commits:

a2f6c62 [nartz] Update configuration.md
74521b8 [Nathan Artz] add spark.driver.memory to config docs

13cab5ba

[SPARK-3844][UI] Truncate appName in WebUI if it is too long · 86b39294

Xiangrui Meng authored 10 years ago

Truncate appName in WebUI if it is too long.

Author: Xiangrui Meng <meng@databricks.com>

Closes #2707 from mengxr/truncate-app-name and squashes the following commits:

87834ce [Xiangrui Meng] move scala import below java
c7111dc [Xiangrui Meng] truncate appName in WebUI if it is too long

86b39294

[SPARK-2805] Upgrade to akka 2.3.4 · b9df8af6

Anand Avati authored 10 years ago

Upgrade to akka 2.3.4

Author: Anand Avati <avati@redhat.com>

Closes #1685 from avati/SPARK-1812-akka-2.3 and squashes the following commits:

57a2315 [Anand Avati] SPARK-1812: streaming - remove tests which depend on akka.actor.IO
2a551d3 [Anand Avati] SPARK-1812: core - upgrade to akka 2.3.4

b9df8af6

[SPARK-3856][MLLIB] use norm operator after breeze 0.10 upgrade · 9c439d33

Xiangrui Meng authored 10 years ago

Got warning msg:

~~~
[warn] /Users/meng/src/spark/mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala:50: method norm in trait NumericOps is deprecated: Use norm(XXX) instead of XXX.norm
[warn]     var norm = vector.toBreeze.norm(p)
~~~

dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #2718 from mengxr/SPARK-3856 and squashes the following commits:

4f38169 [Xiangrui Meng] use norm operator

9c439d33

Fetch from branch v4 in Spark EC2 script. · f706823b
Josh Rosen authored 10 years ago

f706823b

Oct 08, 2014

[SPARK-3857] Create joins package for various join operators. · bcb1ae04

Reynold Xin authored 10 years ago

Author: Reynold Xin <rxin@apache.org>

Closes #2719 from rxin/sql-join-break and squashes the following commits:

0c0082b [Reynold Xin] Fix line length.
cbc664c [Reynold Xin] Rename join -> joins package.
a070d44 [Reynold Xin] Fix line length in HashJoin
a39be8c [Reynold Xin] [SPARK-3857] Create a join package for various join operators.

bcb1ae04

[SQL] Prevents per row dynamic dispatching and pattern matching when inserting Hive values · 3e4f09d2

Cheng Lian authored 10 years ago

Builds all wrappers at first according to object inspector types to avoid per row costs.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2592 from liancheng/hive-value-wrapper and squashes the following commits:

9696559 [Cheng Lian] Passes all tests
4998666 [Cheng Lian] Prevents per row dynamic dispatching and pattern matching when inserting Hive values

3e4f09d2

[SPARK-3810][SQL] Makes PreInsertionCasts handle partitions properly · e7033572

Cheng Lian authored 10 years ago

Includes partition keys into account when applying `PreInsertionCasts` rule.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2672 from liancheng/fix-pre-insert-casts and squashes the following commits:

def1a1a [Cheng Lian] Makes PreInsertionCasts handle partitions properly

e7033572

[SPARK-3707] [SQL] Fix bug of type coercion in DIV · 4ec93195

Cheng Hao authored 10 years ago

Calling `BinaryArithmetic.dataType` will throws exception until it's resolved, but in type coercion rule `Division`, seems doesn't follow this.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2559 from chenghao-intel/type_coercion and squashes the following commits:

199a85d [Cheng Hao] Simplify the divide rule
dc55218 [Cheng Hao] fix bug of type coercion in div

4ec93195

[SQL][Doc] Keep Spark SQL README.md up to date · 00b77917

Liquan Pei authored 10 years ago

marmbrus
Update README.md to be consistent with Spark 1.1

Author: Liquan Pei <liquanpei@gmail.com>

Closes #2706 from Ishiihara/SparkSQL-readme and squashes the following commits:

33b9d4b [Liquan Pei] keep README.md up to date

00b77917

[SPARK-3713][SQL] Uses JSON to serialize DataType objects · a42cc08d

Cheng Lian authored 10 years ago

This PR uses JSON instead of `toString` to serialize `DataType`s. The latter is not only hard to parse but also flaky in many cases.

Since we already write schema information to Parquet metadata in the old style, we have to reserve the old `DataType` parser and ensure downward compatibility. The old parser is now renamed to `CaseClassStringParser` and moved into `object DataType`.

JoshRosen davies Please help review PySpark related changes, thanks!

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2563 from liancheng/datatype-to-json and squashes the following commits:

fc92eb3 [Cheng Lian] Reverts debugging code, simplifies primitive type JSON representation
438c75f [Cheng Lian] Refactors PySpark DataType JSON SerDe per comments
6b6387b [Cheng Lian] Removes debugging code
6a3ee3a [Cheng Lian] Addresses per review comments
dc158b5 [Cheng Lian] Addresses PEP8 issues
99ab4ee [Cheng Lian] Adds compatibility est case for Parquet type conversion
a983a6c [Cheng Lian] Adds PySpark support
f608c6e [Cheng Lian] De/serializes DataType objects from/to JSON

a42cc08d

[SPARK-3831] [SQL] Filter rule Improvement and bool expression optimization. · a85f24ac

Kousuke Saruta authored 10 years ago

If we write the filter which is always FALSE like

SELECT * from person WHERE FALSE;

200 tasks will run. I think, 1 task is enough.

And current optimizer cannot optimize the case NOT is duplicated like

SELECT * from person WHERE NOT ( NOT (age > 30));

The filter rule above should be simplified

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2692 from sarutak/SPARK-3831 and squashes the following commits:

25f3e20 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3831
23c750c [Kousuke Saruta] Improved unsupported predicate test case
a11b9f3 [Kousuke Saruta] Modified NOT predicate test case in PartitionBatchPruningSuite
8ea872b [Kousuke Saruta] Fixed the number of tasks when the data of LocalRelation is empty.

a85f24ac

[SPARK-3843][Minor] Cleanup scalastyle.txt at the end of running dev/scalastyle · add174aa

Kousuke Saruta authored 10 years ago

dev/scalastyle create a log file 'scalastyle.txt'. it is overwrote per running but never deleted even though dev/mima and dev/lint-python delete their log files.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2702 from sarutak/scalastyle-txt-cleanup and squashes the following commits:

d6e238e [Kousuke Saruta] Fixed dev/scalastyle to cleanup scalastyle.txt

add174aa

[SPARK-3841] [mllib] Pretty-print params for ML examples · b92bd5a2

Joseph K. Bradley authored 10 years ago

Provide a parent class for the Params case classes used in many MLlib examples, where the parent class pretty-prints the case class fields:
Param1Name	Param1Value
Param2Name	Param2Value
...
Using this class will make it easier to print test settings to logs.

Also, updated DecisionTreeRunner to print a little more info.

CC: mengxr

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2700 from jkbradley/dtrunner-update and squashes the following commits:

cff873f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
7a08ae4 [Joseph K. Bradley] code review comment updates
b4d2043 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
d8228a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
0fc9c64 [Joseph K. Bradley] Added abstract TestParams class for mllib example parameters
12b7798 [Joseph K. Bradley] Added abstract class TestParams for pretty-printing Params values
5f84f03 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
f7441b6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
19eb6fc [Joseph K. Bradley] Updated DecisionTreeRunner to print training time.

b92bd5a2

HOTFIX: Use correct Hadoop profile in build · bc441872
Patrick Wendell authored 10 years ago

bc441872

[SPARK-3848] yarn alpha doesn't build on master · f18dd596

Kousuke Saruta authored 10 years ago

yarn alpha build was broken by #2432
as it added an argument to YarnAllocator but not to yarn/alpha YarnAllocationHandler
commit https://github.com/apache/spark/commit/79e45c9323455a51f25ed9acd0edd8682b4bbb88

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2715 from sarutak/SPARK-3848 and squashes the following commits:

bafb8d1 [Kousuke Saruta] Fixed parameters for the default constructor of alpha/YarnAllocatorHandler.

f18dd596

[SPARK-3788] [yarn] Fix compareFs to do the right thing for HDFS namespaces. · 7fca8f41

Marcelo Vanzin authored 10 years ago

HA and viewfs use namespaces instead of host names, so you can't
resolve them since that will fail. So be smarter to avoid doing
unnecessary work.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2649 from vanzin/SPARK-3788 and squashes the following commits:

fedbc73 [Marcelo Vanzin] Update comment.
c938845 [Marcelo Vanzin] Use Objects.equal() to avoid issues with ==.
9f7b571 [Marcelo Vanzin] [SPARK-3788] [yarn] Fix compareFs to do the right thing for HA, federation.

7fca8f41

[SPARK-3710] Fix Yarn integration tests on Hadoop 2.2. · 35afdfd6

Marcelo Vanzin authored 10 years ago

It seems some dependencies are not declared when pulling the 2.2
test dependencies, so we need to add them manually for the Yarn
cluster to come up.

These don't seem to be necessary for 2.3 and beyond, so restrict
them to the hadoop-2.2 profile.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2682 from vanzin/SPARK-3710 and squashes the following commits:

701d4fb [Marcelo Vanzin] Add comment.
0540bdf [Marcelo Vanzin] [SPARK-3710] Fix Yarn integration tests on Hadoop 2.2.

35afdfd6

[SPARK-3836] [REPL] Spark REPL optionally propagate internal exceptions · c7818434

Ahir Reddy authored 10 years ago

Optionally have the repl throw exceptions generated by interpreted code, instead of swallowing the exception and returning it as text output. This is useful when embedding the repl, otherwise it's not possible to know when user code threw an exception.

Author: Ahir Reddy <ahirreddy@gmail.com>

Closes #2695 from ahirreddy/repl-throw-exceptions and squashes the following commits:

bad25ee [Ahir Reddy] Style Fixes
f0e5b44 [Ahir Reddy] Fixed style
0d4413d [Ahir Reddy] propogate excetions from repl

c7818434

Oct 07, 2014

[SPARK-3412] [PySpark] Replace Epydoc with Sphinx to generate Python API docs · 798ed22c

Davies Liu authored 10 years ago

Retire Epydoc, use Sphinx to generate API docs.

Refine Sphinx docs, also convert some docstrings into Sphinx style.

It looks like:
![api doc](https://cloud.githubusercontent.com/assets/40902/4538272/9e2d4f10-4dec-11e4-8d96-6e45a8fe51f9.png)

Author: Davies Liu <davies.liu@gmail.com>

Closes #2689 from davies/docs and squashes the following commits:

bf4a0a5 [Davies Liu] fix links
3fb1572 [Davies Liu] fix _static in jekyll
65a287e [Davies Liu] fix scripts and logo
8524042 [Davies Liu] Merge branch 'master' of github.com:apache/spark into docs
d5b874a [Davies Liu] Merge branch 'master' of github.com:apache/spark into docs
4bc1c3c [Davies Liu] refactor
746d0b6 [Davies Liu] @param -> :param
240b393 [Davies Liu] replace epydoc with sphinx doc

798ed22c

[SPARK-3829] Make Spark logo image on the header of HistoryPage as a link to HistoryPage's page #1 · b69c9fb6

Kousuke Saruta authored 10 years ago

There is a Spark logo on the header of HistoryPage.
We can have too many HistoryPages if we run 20+ applications. So I think, it's useful if the logo is as a link to the HistoryPage's page number 1.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2690 from sarutak/SPARK-3829 and squashes the following commits:

908c109 [Kousuke Saruta] Removed extra space.
00bfbd7 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3829
dd87480 [Kousuke Saruta] Made header Spark log image as a link to History Server's top page.

b69c9fb6