Commits · a973f483f6b819ed4ecac27ff5c064ea13a8dd71 · cs525-sp18-g07 / spark

Jan 18, 2016

[SPARK-12814][DOCUMENT] Add deploy instructions for Python in flume integration doc · a973f483

Shixiong Zhu authored 9 years ago

This PR added instructions to get flume assembly jar for Python users in the flume integration page like Kafka doc.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10746 from zsxwing/flume-doc.

a973f483

[SPARK-12882][SQL] simplify bucket tests and add more comments · 40419022

Wenchen Fan authored 9 years ago

Right now, the bucket tests are kind of hard to understand, this PR simplifies them and add more commetns.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10813 from cloud-fan/bucket-comment.

40419022

[SPARK-12841][SQL] fix cast in filter · 4f11e3f2

Wenchen Fan authored 9 years ago

In SPARK-10743 we wrap cast with `UnresolvedAlias` to give `Cast` a better alias if possible. However, for cases like `filter`, the `UnresolvedAlias` can't be resolved and actually we don't need a better alias for this case. This PR move the cast wrapping logic to `Column.named` so that we will only do it when we need a alias name.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10781 from cloud-fan/bug.

4f11e3f2

[SPARK-12855][SQL] Remove parser dialect developer API · 38c3c0e3

Reynold Xin authored 9 years ago

This pull request removes the public developer parser API for external parsers. Given everything a parser depends on (e.g. logical plans and expressions) are internal and not stable, external parsers will break with every release of Spark. It is a bad idea to create the illusion that Spark actually supports pluggable parsers. In addition, this also reduces incentives for 3rd party projects to contribute parse improvements back to Spark.

Author: Reynold Xin <rxin@databricks.com>

Closes #10801 from rxin/SPARK-12855.

38c3c0e3

[SPARK-10985][CORE] Avoid passing evicted blocks throughout BlockManager · b8cb548a

Josh Rosen authored 9 years ago

This patch refactors portions of the BlockManager and CacheManager in order to avoid having to pass `evictedBlocks` lists throughout the code. It appears that these lists were only consumed by `TaskContext.taskMetrics`, so the new code now directly updates the metrics from the lower-level BlockManager methods.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10776 from JoshRosen/SPARK-10985.

b8cb548a

[SPARK-12884] Move classes to their own files for readability · 302bb569

Andrew Or authored 9 years ago

This is a small step in implementing SPARK-10620, which migrates `TaskMetrics` to accumulators. This patch is strictly a cleanup patch and introduces no change in functionality. It literally just moves classes to their own files to avoid having single monolithic ones that contain 10 different classes.

Parent PR: #10717

Author: Andrew Or <andrew@databricks.com>

Closes #10810 from andrewor14/move-things.

302bb569

[SPARK-12346][ML] Missing attribute names in GLM for vector-type features · 5e492e9d

Eric Liang authored 9 years ago

Currently `summary()` fails on a GLM model fitted over a vector feature missing ML attrs, since the output feature attrs will also have no name. We can avoid this situation by forcing `VectorAssembler` to make up suitable names when inputs are missing names.

cc mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #10323 from ericl/spark-12346.

5e492e9d

[SPARK-12873][SQL] Add more comment in HiveTypeCoercion for type widening · 44fcf992

Reynold Xin authored 9 years ago

I was reading this part of the analyzer code again and got confused by the difference between findWiderTypeForTwo and findTightestCommonTypeOfTwo.

I also simplified WidenSetOperationTypes to make it a lot simpler. The easiest way to review this one is to just read the original code, and the new code. The logic is super simple.

Author: Reynold Xin <rxin@databricks.com>

Closes #10802 from rxin/SPARK-12873.

44fcf992

[SPARK-12558][FOLLOW-UP] AnalysisException when multiple functions applied in GROUP BY clause · db9a8605

Dilip Biswal authored 9 years ago

Addresses the comments from Yin.
https://github.com/apache/spark/pull/10520

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #10758 from dilipbiswal/spark-12558-followup.

db9a8605

[SPARK-10264][DOCUMENTATION] Added @Since to ml.recomendation · 233d6cee

Tommy YU authored 9 years ago

I create new pr since original pr long time no update.
Please help to review.

srowen

Author: Tommy YU <tummyyu@163.com>

Closes #10756 from Wenpei/add_since_to_recomm.

233d6cee

Jan 17, 2016

[SQL] [MINOR] speed up hashcode for UTF8String · bc36b0f1

Wenchen Fan authored 9 years ago

similar to https://github.com/apache/spark/pull/10784, use `Murmur3_x86_32.hashUnsafeBytes` instead.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10791 from cloud-fan/string-hashcode.

bc36b0f1

[SPARK-12862][SPARKR] Jenkins does not run R tests · 92502703

felixcheung authored 9 years ago

Slight correction: I'm leaving sparkR as-is (ie. R file not supported) and fixed only run-tests.sh as shivaram described.

I also assume we are going to cover all doc changes in https://issues.apache.org/jira/browse/SPARK-12846 instead of here.

rxin shivaram zjffdu

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10792 from felixcheung/sparkRcmd.

92502703

[SPARK-12860] [SQL] speed up safe projection for primitive types · cede7b2a

Wenchen Fan authored 9 years ago

The idea is simple, use `SpecificMutableRow` instead of `GenericMutableRow` as result row for safe projection.

A simple benchmark shows about 1.5x speed up for primitive types, code: https://gist.github.com/cloud-fan/fa77713ccebf0823b2ab#file-safeprojectionbenchmark-scala

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10790 from cloud-fan/safe-projection.

cede7b2a

Jan 16, 2016

[SPARK-12796] [SQL] Whole stage codegen · 3c0d2365

Davies Liu authored 9 years ago

This is the initial work for whole stage codegen, it support Projection/Filter/Range, we will continue work on this to support more physical operators.

A micro benchmark show that a query with range, filter and projection could be 3X faster then before.

It's turned on by default. For a tree that have at least two chained plans, a WholeStageCodegen will be inserted into it, for example, the following plan
```
Limit 10
+- Project [(id#5L + 1) AS (id + 1)#6L]
   +- Filter ((id#5L & 1) = 1)
      +- Range 0, 1, 4, 10, [id#5L]
```
will be translated into
```
Limit 10
+- WholeStageCodegen
      +- Project [(id#1L + 1) AS (id + 1)#2L]
         +- Filter ((id#1L & 1) = 1)
            +- Range 0, 1, 4, 10, [id#1L]
```

Here is the call graph to generate Java source for A and B (A  support codegen, but B does not):

```
  *   WholeStageCodegen       Plan A               FakeInput        Plan B
  * =========================================================================
  *
  * -> execute()
  *     |
  *  doExecute() -------->   produce()
  *                             |
  *                          doProduce()  -------> produce()
  *                                                   |
  *                                                doProduce() ---> execute()
  *                                                   |
  *                                                consume()
  *                          doConsume()  ------------|
  *                             |
  *  doConsume()  <-----    consume()
```

A SparkPlan that support codegen need to implement doProduce() and doConsume():

```
def doProduce(ctx: CodegenContext): (RDD[InternalRow], String)
def doConsume(ctx: CodegenContext, child: SparkPlan, input: Seq[ExprCode]): String
```

Author: Davies Liu <davies@databricks.com>

Closes #10735 from davies/whole2.

3c0d2365

[SPARK-12722][DOCS] Fixed typo in Pipeline example · 86972fa5

Jeff Lam authored 9 years ago

http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline
```
val sameModel = Pipeline.load("/tmp/spark-logistic-regression-model")
```
should be
```
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
```
cc: jkbradley

Author: Jeff Lam <sha0lin@alumni.carnegiemellon.edu>

Closes #10769 from Agent007/SPARK-12722.

86972fa5

[SPARK-12856] [SQL] speed up hashCode of unsafe array · 2f7d0b68

Wenchen Fan authored 9 years ago

We iterate the bytes to calculate hashCode before, but now we have `Murmur3_x86_32.hashUnsafeBytes` that don't require the bytes to be word algned, we should use that instead.

A simple benchmark shows it's about 3 X faster, benchmark code: https://gist.github.com/cloud-fan/fa77713ccebf0823b2ab#file-arrayhashbenchmark-scala

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10784 from cloud-fan/array-hashcode.

2f7d0b68

Jan 15, 2016

[SPARK-12840] [SQL] Support passing arbitrary objects (not just expressions)... · 242efb75

Davies Liu authored 9 years ago

[SPARK-12840] [SQL] Support passing arbitrary objects (not just expressions) into code generated classes

This is a refactor to support codegen for aggregation and broadcast join.

Author: Davies Liu <davies@databricks.com>

Closes #10777 from davies/rename2.

242efb75

[SPARK-12644][SQL] Update parquet reader to be vectorized. · 9039333c

Nong Li authored 9 years ago

This inlines a few of the Parquet decoders and adds vectorized APIs to support decoding in batch.
There are a few particulars in the Parquet encodings that make this much more efficient. In
particular, RLE encodings are very well suited for batch decoding. The Parquet 2.0 encodings are
also very suited for this.

This is a work in progress and does not affect the current execution. In subsequent patches, we will
support more encodings and types before enabling this.

Simple benchmarks indicate this can decode single ints about > 3x faster.

Author: Nong Li <nong@databricks.com>
Author: Nong <nongli@gmail.com>

Closes #10593 from nongli/spark-12644.

9039333c

[SPARK-12649][SQL] support reading bucketed table · 3b5ccb12

Wenchen Fan authored 9 years ago

This PR adds the support to read bucketed tables, and correctly populate `outputPartitioning`, so that we can avoid shuffle for some cases.

TODO(follow-up PRs):

* bucket pruning
* avoid shuffle for bucketed table join when use any super-set of the bucketing key.
 (we should re-visit it after https://issues.apache.org/jira/browse/SPARK-12704 is fixed)
* recognize hive bucketed table

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10604 from cloud-fan/bucket-read.

3b5ccb12

[SPARK-12842][TEST-HADOOP2.7] Add Hadoop 2.7 build profile · 8dbbf3e7

Josh Rosen authored 9 years ago

This patch adds a Hadoop 2.7 build profile in order to let us automate tests against that version.

/cc rxin srowen

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10775 from JoshRosen/add-hadoop-2.7-profile.

8dbbf3e7

[SPARK-12833][HOT-FIX] Reset the locale after we set it. · f6ddbb36
Yin Huai authored 9 years ago
```
Author: Yin Huai <yhuai@databricks.com>

Closes #10778 from yhuai/resetLocale.
```
f6ddbb36

[SPARK-11925][ML][PYSPARK] Add PySpark missing methods for ml.feature during Spark 1.6 QA · 5f843781

Yanbo Liang authored 9 years ago

Add PySpark missing methods and params for ml.feature:
* ```RegexTokenizer``` should support setting ```toLowercase```.
* ```MinMaxScalerModel``` should support output ```originalMin``` and ```originalMax```.
* ```PCAModel``` should support output ```pc```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9908 from yanboliang/spark-11925.

5f843781

[SPARK-12575][SQL] Grammar parity with existing SQL parser · 7cd7f220

Herman van Hovell authored 9 years ago

In this PR the new CatalystQl parser stack reaches grammar parity with the old Parser-Combinator based SQL Parser. This PR also replaces all uses of the old Parser, and removes it from the code base.

Although the existing Hive and SQL parser dialects were mostly the same, some kinks had to be worked out:
- The SQL Parser allowed syntax like ```APPROXIMATE(0.01) COUNT(DISTINCT a)```. In order to make this work we needed to hardcode approximate operators in the parser, or we would have to create an approximate expression. ```APPROXIMATE_COUNT_DISTINCT(a, 0.01)``` would also do the job and is much easier to maintain. So, this PR **removes** this keyword.
- The old SQL Parser supports ```LIMIT``` clauses in nested queries. This is **not supported** anymore. See https://github.com/apache/spark/pull/10689 for the rationale for this.
- Hive has a charset name char set literal combination it supports, for instance the following expression ```_ISO-8859-1 0x4341464562616265``` would yield this string: ```CAFEbabe```. Hive will only allow charset names to start with an underscore. This is quite annoying in spark because as soon as you use a tuple names will start with an underscore. In this PR we **remove** this feature from the parser. It would be quite easy to implement such a feature as an Expression later on.
- Hive and the SQL Parser treat decimal literals differently. Hive will turn any decimal into a ```Double``` whereas the SQL Parser would convert a non-scientific decimal into a ```BigDecimal```, and would turn a scientific decimal into a Double. We follow Hive's behavior here. The new parser supports a big decimal literal, for instance: ```81923801.42BD```, which can be used when a big decimal is needed.

cc rxin viirya marmbrus yhuai cloud-fan

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10745 from hvanhovell/SPARK-12575-2.

7cd7f220

[SQL][MINOR] BoundReference do not need to be NamedExpression · 3f1c58d6

Wenchen Fan authored 9 years ago

We made it a `NamedExpression` to workaroud some hacky cases long time ago, and now seems it's safe to remove it.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10765 from cloud-fan/minor.

3f1c58d6

[SPARK-12716][WEB UI] Add a TOTALS row to the Executors Web UI · 61c45876

Alex Bozarth authored 9 years ago

Added a Totals table to the top of the page to display the totals of each applicable column in the executors table.

Old Description:
~~Created a TOTALS row containing the totals of each column in the executors UI. By default the TOTALS row appears at the top of the table. When a column is sorted the TOTALS row will always sort to either the top or bottom of the table.~~

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #10668 from ajbozarth/spark12716.

61c45876

Fix typo · 0bb73554

Julien Baley authored 9 years ago

disvoered => discovered

Author: Julien Baley <julien.baley@gmail.com>

Closes #10773 from julienbaley/patch-1.

0bb73554

[SPARK-12833][HOT-FIX] Fix scala 2.11 compilation. · 513266c0

Yin Huai authored 9 years ago

Seems https://github.com/apache/spark/commit/5f83c6991c95616ecbc2878f8860c69b2826f56c breaks scala 2.11 compilation.

Author: Yin Huai <yhuai@databricks.com>

Closes #10774 from yhuai/fixScala211Compile.

513266c0

[SPARK-12667] Remove block manager's internal "external block store" API · ad1503f9

Reynold Xin authored 9 years ago

This pull request removes the external block store API. This is rarely used, and the file system interface is actually a better, more standard way to interact with external storage systems.

There are some other things to remove also, as pointed out by JoshRosen. We will do those as follow-up pull requests.

Author: Reynold Xin <rxin@databricks.com>

Closes #10752 from rxin/remove-offheap.

ad1503f9

[SPARK-12833][SQL] Initial import of spark-csv · 5f83c699

Hossein authored 9 years ago

CSV is the most common data format in the "small data" world. It is often the first format people want to try when they see Spark on a single node. Having to rely on a 3rd party component for this leads to poor user experience for new users. This PR merges the popular spark-csv data source package (https://github.com/databricks/spark-csv) with SparkSQL.

This is a first PR to bring the functionality to spark 2.0 master. We will complete items outlines in the design document (see JIRA attachment) in follow up pull requests.

Author: Hossein <hossein@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #10766 from rxin/csv.

5f83c699

[MINOR] [SQL] GeneratedExpressionCode -> ExprCode · c5e7076d

Davies Liu authored 9 years ago

GeneratedExpressionCode is too long

Author: Davies Liu <davies@databricks.com>

Closes #10767 from davies/renaming.

c5e7076d

[SPARK-11031][SPARKR] Method str() on a DataFrame · ba4a6419

Oscar D. Lara Yejas authored 9 years ago

Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>
Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com>
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>

Closes #9613 from olarayej/SPARK-11031.

ba4a6419

[SPARK-2930] clarify docs on using webhdfs with spark.yarn.access.nam… · 96fb894d
Tom Graves authored 9 years ago
```
…enodes

Author: Tom Graves <tgraves@yahoo-inc.com>

Closes #10699 from tgravescs/SPARK-2930.
```
96fb894d

[SPARK-12655][GRAPHX] GraphX does not unpersist RDDs · d0a5c32b

Jason Lee authored 9 years ago

Some VertexRDD and EdgeRDD are created during the intermediate step of g.connectedComponents() but unnecessarily left cached after the method is done. The fix is to unpersist these RDDs once they are no longer in use.

A test case is added to confirm the fix for the reported bug.

Author: Jason Lee <cjlee@us.ibm.com>

Closes #10713 from jasoncl/SPARK-12655.

d0a5c32b

[SPARK-12830] Java style: disallow trailing whitespaces. · fe7246fe
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #10764 from rxin/SPARK-12830.
```
fe7246fe

Jan 14, 2016

[SPARK-12829] Turn Java style checker on · 591c88c9

Reynold Xin authored 9 years ago

It was previously turned off because there was a problem with a pull request. We should turn it on now.

Author: Reynold Xin <rxin@databricks.com>

Closes #10763 from rxin/SPARK-12829.

591c88c9

[SPARK-12708][UI] Sorting task error in Stages Page when yarn mode. · 32cca933

Koyo Yoshida authored 9 years ago

If sort column contains slash(e.g. "Executor ID / Host") when yarn mode,sort fail with following message.

![spark-12708](https://cloud.githubusercontent.com/assets/6679275/12193320/80814f8c-b62a-11e5-9914-7bf3907029df.png)

Ｉt's similar to SPARK-4313 .

Author: root <root@R520T1.(none)>
Author: Koyo Yoshida <koyo0615@gmail.com>

Closes #10663 from yoshidakuy/SPARK-12708.

32cca933

[SPARK-12813][SQL] Eliminate serialization for back to back operations · cc7af86a

Michael Armbrust authored 9 years ago

The goal of this PR is to eliminate unnecessary translations when there are back-to-back `MapPartitions` operations.  In order to achieve this I also made the following simplifications:

 - Operators no longer have hold encoders, instead they have only the expressions that they need.  The benefits here are twofold: the expressions are visible to transformations so go through the normal resolution/binding process.  now that they are visible we can change them on a case by case basis.
 - Operators no longer have type parameters.  Since the engine is responsible for its own type checking, having the types visible to the complier was an unnecessary complication.  We still leverage the scala compiler in the companion factory when constructing a new operator, but after this the types are discarded.

Deferred to a follow up PR:
 - Remove as much of the resolution/binding from Dataset/GroupedDataset as possible. We should still eagerly check resolution and throw an error though in the case of mismatches for an `as` operation.
 - Eliminate serializations in more cases by adding more cases to `EliminateSerialization`

Author: Michael Armbrust <michael@databricks.com>

Closes #10747 from marmbrus/encoderExpressions.

cc7af86a

[SPARK-12174] Speed up BlockManagerSuite getRemoteBytes() test · 25782981

Josh Rosen authored 9 years ago

This patch significantly speeds up the BlockManagerSuite's "SPARK-9591: getRemoteBytes from another location when Exception throw" test, reducing the test time from 45s to ~250ms. The key change was to set `spark.shuffle.io.maxRetries` to 0 (the code previously set `spark.network.timeout` to `2s`, but this didn't make a difference because the slowdown was not due to this timeout).

Along the way, I also cleaned up the way that we handle SparkConf in BlockManagerSuite: previously, each test would mutate a shared SparkConf instance, while now each test gets a fresh SparkConf.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10759 from JoshRosen/SPARK-12174.

25782981

[SPARK-12821][BUILD] Style checker should run when some configuration files... · bcc7373f

Kousuke Saruta authored 9 years ago

[SPARK-12821][BUILD] Style checker should run when some configuration files for style are modified but any source files are not.

When running the `run-tests` script, style checkers run only when any source files are modified but they should run when configuration files related to style are modified.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10754 from sarutak/SPARK-12821.

bcc7373f

[SPARK-12771][SQL] Simplify CaseWhen code generation · 902667fd

Reynold Xin authored 9 years ago

The generated code for CaseWhen uses a control variable "got" to make sure we do not evaluate more branches once a branch is true. Changing that to generate just simple "if / else" would be slightly more efficient.

This closes #10737.

Author: Reynold Xin <rxin@databricks.com>

Closes #10755 from rxin/SPARK-12771.

902667fd