Commits · working · cs525-sp18-g07 / spark

May 06, 2018
- rolled back unwated change · 5fb8ce91
  skeirik2 authored 7 years ago
  
  5fb8ce91
- added extra logging to memorystore · 174f0271
  skeirik2 authored 7 years ago
  
  174f0271
- added more logging to getOrCompute in RDD.scala · 28f1108e
  Skeirik authored 7 years ago
  
  28f1108e
- removed final duplicated code from evictBlocksToFreeSpace · 82d4d569
  skeirik2 authored 7 years ago
  
  82d4d569
- added missing code to putBytes from LCS · b534e2dd
  skeirik2 authored 7 years ago
  
  b534e2dd
- added missing code from Chinese team to BlockManager · e2297b3b
  skeirik2 authored 7 years ago
  
  e2297b3b
- added an extra condition for dropping blocks in evictBlocksToFreeSpace to ensure correctness · 180714a2
  skeirik2 authored 7 years ago
  
  180714a2
May 05, 2018
- fixed BlockManager, BlockInfoManager, and MemoryStore by correctly referencing... · 154163ed
  skeirik2 authored 7 years ago
  
  fixed BlockManager, BlockInfoManager, and MemoryStore by correctly referencing shared data structures between the three
  154163ed
- fixed DAGScheduler by correcting function name · 8cca089c
  skeirik2 authored 7 years ago
  
  8cca089c
- fixed executor/Executor.scala by adding missing import · 7e9d0938
  skeirik2 authored 7 years ago
  
  7e9d0938
- fixed up RDD.scala by correctly referencing blockExInfo data structure · a62ca515
  skeirik2 authored 7 years ago
  
  a62ca515
- fixed package name for scheduler/StageExInfo · b9a39de3
  skeirik2 authored 7 years ago
  
  b9a39de3
- fixed Task data structures by correctly intializing extra optional arguments · 8ab4e4c6
  skeirik2 authored 7 years ago
  
  8ab4e4c6
- added missing brace and reformatted · 3002a25c
  skeirik2 authored 7 years ago
  
  3002a25c
- adding newline to pass scalastyle check · 2d90fced
  skeirik2 authored 7 years ago
  
  2d90fced
- fixed missing paren · 0015d273
  skeirik2 authored 7 years ago
  
  0015d273
- Added BlockExInfo data struct · 28f62df8
  skeirik2 authored 7 years ago
  
  28f62df8
- fixing empty line error · 62354484
  skeirik2 authored 7 years ago
  
  62354484
- updated files to pass scalastyle check · d8cf047d
  skeirik2 authored 7 years ago
  
  d8cf047d
- updated memory test suite to avoid compilation errors---but really, tests should be skipped · 784e7e8b
  skeirik2 authored 7 years ago
  
  784e7e8b
- updated updates file · bba436fc
  skeirik2 authored 7 years ago
  
  bba436fc
- updated spark /storage files · b51570bc
  skeirik2 authored 7 years ago
  
  b51570bc
- modified task data-struct · d8bf00c1
  skeirik2 authored 7 years ago
  
  d8bf00c1
- added stagexinfo file · 0f2f3c03
  skeirik2 authored 7 years ago
  
  0f2f3c03
Apr 01, 2018
- working on algorithm · c2a01b8c
  skeirik2 authored 7 years ago
  
  c2a01b8c
Mar 01, 2018

[SPARK-23405] Generate additional constraints for Join's children · cdcccd7b

KaiXinXiaoLei authored 7 years ago

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
I run a sql: `select ls.cs_order_number from ls left semi join catalog_sales cs on ls.cs_order_number = cs.cs_order_number`, The `ls` table is a small table ,and the number is one. The `catalog_sales` table is a big table,  and the number is 10 billion. The task will be hang up. And i find the many null values of `cs_order_number` in the `catalog_sales` table. I think the null value should be removed in the logical plan.

>== Optimized Logical Plan ==
>Join LeftSemi, (cs_order_number#1 = cs_order_number#22)
>:- Project cs_order_number#1
>   : +- Filter isnotnull(cs_order_number#1)
>      : +- MetastoreRelation 100t, ls
>+- Project cs_order_number#22
>   +- MetastoreRelation 100t, catalog_sales

Now, use this patch, the plan will be:
>== Optimized Logical Plan ==
>Join LeftSemi, (cs_order_number#1 = cs_order_number#22)
>:- Project cs_order_number#1
>   : +- Filter isnotnull(cs_order_number#1)
>      : +- MetastoreRelation 100t, ls
>+- Project cs_order_number#22
>   : **+- Filter isnotnull(cs_order_number#22)**
>     :+- MetastoreRelation 100t, catalog_sales

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: KaiXinXiaoLei <584620569@qq.com>
Author: hanghang <584620569@qq.com>

Closes #20670 from KaiXinXiaoLei/Spark-23405.

cdcccd7b

[SPARK-23510][SQL] Support Hive 2.2 and Hive 2.3 metastore · ff148018

Yuming Wang authored 7 years ago

## What changes were proposed in this pull request?
This is based on https://github.com/apache/spark/pull/20668 for supporting Hive 2.2 and Hive 2.3 metastore.

When we merge the PR, we should give the major credit to wangyum

## How was this patch tested?
Added the test cases

Author: Yuming Wang <yumwang@ebay.com>
Author: gatorsmile <gatorsmile@gmail.com>

Closes #20671 from gatorsmile/pr-20668.

ff148018

[SPARK-23389][CORE] When the shuffle dependency specifies aggregation ,and... · 22f3d333

liuxian authored 7 years ago

[SPARK-23389][CORE] When the shuffle dependency specifies aggregation ,and `dependency.mapSideCombine =false`, we should be able to use serialized sorting.

## What changes were proposed in this pull request?
When the shuffle dependency specifies aggregation ,and `dependency.mapSideCombine=false`, in the map side,there is no need for aggregation and sorting, so we should be able to use serialized sorting.

## How was this patch tested?
Existing unit test

Author: liuxian <liu.xian3@zte.com.cn>

Closes #20576 from 10110346/mapsidecombine.

22f3d333

Feb 28, 2018

[SPARK-23523][SQL][FOLLOWUP] Minor refactor of OptimizeMetadataOnlyQuery · 25c2776d

Xingbo Jiang authored 7 years ago

## What changes were proposed in this pull request?

Inside `OptimizeMetadataOnlyQuery.getPartitionAttrs`, avoid using `zip` to generate attribute map.
Also include other minor update of comments and format.

## How was this patch tested?

Existing test cases.

Author: Xingbo Jiang <xingbo.jiang@databricks.com>

Closes #20693 from jiangxb1987/SPARK-23523.

25c2776d

[SPARK-23514] Use SessionState.newHadoopConf() to propage hadoop configs set in SQLConf. · 476a7f02

Juliusz Sompolski authored 7 years ago

## What changes were proposed in this pull request?

A few places in `spark-sql` were using `sc.hadoopConfiguration` directly. They should be using `sessionState.newHadoopConf()` to blend in configs that were set through `SQLConf`.

Also, for better UX, for these configs blended in from `SQLConf`, we should consider removing the `spark.hadoop` prefix, so that the settings are recognized whether or not they were specified by the user.

## How was this patch tested?

Tested that AlterTableRecoverPartitions now correctly recognizes settings that are passed in to the FileSystem through SQLConf.

Author: Juliusz Sompolski <julek@databricks.com>

Closes #20679 from juliuszsompolski/SPARK-23514.

476a7f02

[SPARK-23517][PYTHON] Make `pyspark.util._exception_message` produce the trace... · fab563b9

hyukjinkwon authored 7 years ago

[SPARK-23517][PYTHON] Make `pyspark.util._exception_message` produce the trace from Java side by Py4JJavaError

## What changes were proposed in this pull request?

This PR proposes for `pyspark.util._exception_message` to produce the trace from Java side by `Py4JJavaError`.

Currently, in Python 2, it uses `message` attribute which `Py4JJavaError` didn't happen to have:

```python
>>> from pyspark.util import _exception_message
>>> try:
...     sc._jvm.java.lang.String(None)
... except Exception as e:
...     pass
...
>>> e.message
''
```

Seems we should use `str` instead for now:

 https://github.com/bartdag/py4j/blob/aa6c53b59027925a426eb09b58c453de02c21b7c/py4j-python/src/py4j/protocol.py#L412

but this doesn't address the problem with non-ascii string from Java side -
 `https://github.com/bartdag/py4j/issues/306`

So, we could directly call `__str__()`:

```python
>>> e.__str__()
u'An error occurred while calling None.java.lang.String.\n: java.lang.NullPointerException\n\tat java.lang.String.<init>(String.java:588)\n\tat sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)\n\tat sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)\n\tat sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n\tat java.lang.reflect.Constructor.newInstance(Constructor.java:422)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:238)\n\tat py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)\n\tat py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:214)\n\tat java.lang.Thread.run(Thread.java:745)\n'
```

which doesn't type coerce unicodes to `str` in Python 2.

This can be actually a problem:

```python
from pyspark.sql.functions import udf
spark.conf.set("spark.sql.execution.arrow.enabled", True)
spark.range(1).select(udf(lambda x: [[]])()).toPandas()
```

**Before**

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in toPandas
    raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
RuntimeError:
Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
```

**After**

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in toPandas
    raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
RuntimeError: An error occurred while calling o47.collectAsArrowToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 (TID 7, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/.../spark/python/pyspark/worker.py", line 245, in main
    process()
  File "/.../spark/python/pyspark/worker.py", line 240, in process
...
Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
```

## How was this patch tested?

Manually tested and unit tests were added.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #20680 from HyukjinKwon/SPARK-23517.

fab563b9

[SPARK-23508][CORE] Fix BlockmanagerId in case blockManagerIdCache cause oom · 6a8abe29

zhoukang authored 7 years ago

… cause oom

## What changes were proposed in this pull request?
blockManagerIdCache in BlockManagerId will not remove old values which may cause oom

`val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()`
Since whenever we apply a new BlockManagerId, it will put into this map.

This patch will use guava cahce for  blockManagerIdCache instead.

A heap dump show in [SPARK-23508](https://issues.apache.org/jira/browse/SPARK-23508)

## How was this patch tested?
Exist tests.

Author: zhoukang <zhoukang199191@gmail.com>

Closes #20667 from caneGuy/zhoukang/fix-history.

6a8abe29

Feb 27, 2018

[SPARK-23448][SQL] Clarify JSON and CSV parser behavior in document · b14993e1

Liang-Chi Hsieh authored 7 years ago

## What changes were proposed in this pull request?

Clarify JSON and CSV reader behavior in document.

JSON doesn't support partial results for corrupted records.
CSV only supports partial results for the records with more or less tokens.

## How was this patch tested?

Pass existing tests.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #20666 from viirya/SPARK-23448-2.

b14993e1

[SPARK-23417][PYTHON] Fix the build instructions supplied by exception... · 23ac3aab

Bruce Robbins authored 7 years ago

[SPARK-23417][PYTHON] Fix the build instructions supplied by exception messages in python streaming tests

## What changes were proposed in this pull request?

Fix the build instructions supplied by exception messages in python streaming tests.

I also added -DskipTests to the maven instructions to avoid the 170 minutes of scala tests that occurs each time one wants to add a jar to the assembly directory.

## How was this patch tested?

- clone branch
- run build/sbt package
- run python/run-tests --modules "pyspark-streaming" , expect error message
- follow instructions in error message. i.e., run build/sbt assembly/package streaming-kafka-0-8-assembly/assembly
- rerun python tests, expect error message
- follow instructions in error message. i.e run build/sbt -Pflume assembly/package streaming-flume-assembly/assembly
- rerun python tests, see success.
- repeated all of the above for mvn version of the process.

Author: Bruce Robbins <bersprockets@gmail.com>

Closes #20638 from bersprockets/SPARK-23417_propa.

23ac3aab

[SPARK-23501][UI] Refactor AllStagesPage in order to avoid redundant code · 598446b7

Marco Gaido authored 7 years ago

As suggested in #20651, the code is very redundant in `AllStagesPage` and modifying it is a copy-and-paste work. We should avoid such a pattern, which is error prone, and have a cleaner solution which avoids code redundancy.

existing UTs

Author: Marco Gaido <marcogaido91@gmail.com>

Closes #20663 from mgaido91/SPARK-23475_followup.

598446b7

[SPARK-23365][CORE] Do not adjust num executors when killing idle executors. · ecb8b383

Imran Rashid authored 7 years ago

The ExecutorAllocationManager should not adjust the target number of
executors when killing idle executors, as it has already adjusted the
target number down based on the task backlog.

The name `replace` was misleading with DynamicAllocation on, as the target number
of executors is changed outside of the call to `killExecutors`, so I adjusted that name. Also separated out the logic of `countFailures` as you don't always want that tied to `replace`.

While I was there I made two changes that weren't directly related to this:
1) Fixed `countFailures` in a couple cases where it was getting an incorrect value since it used to be tied to `replace`, eg. when killing executors on a blacklisted node.
2) hard error if you call `sc.killExecutors` with dynamic allocation on, since that's another way the ExecutorAllocationManager and the CoarseGrainedSchedulerBackend would get out of sync.

Added a unit test case which verifies that the calls to ExecutorAllocationClient do not adjust the number of executors.

Author: Imran Rashid <irashid@cloudera.com>

Closes #20604 from squito/SPARK-23365.

ecb8b383

[SPARK-23523][SQL] Fix the incorrect result caused by the rule OptimizeMetadataOnlyQuery · 414ee867

gatorsmile authored 7 years ago

## What changes were proposed in this pull request?
```Scala
val tablePath = new File(s"${path.getCanonicalPath}/cOl3=c/cOl1=a/cOl5=e")
 Seq(("a", "b", "c", "d", "e")).toDF("cOl1", "cOl2", "cOl3", "cOl4", "cOl5")
 .write.json(tablePath.getCanonicalPath)
 val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", "CoL3").distinct()
 df.show()
```

It generates a wrong result.
```
[c,e,a]
```

We have a bug in the rule `OptimizeMetadataOnlyQuery `. We should respect the attribute order in the original leaf node. This PR is to fix it.

## How was this patch tested?
Added a test case

Author: gatorsmile <gatorsmile@gmail.com>

Closes #20684 from gatorsmile/optimizeMetadataOnly.

414ee867

[SPARK-17147][STREAMING][KAFKA] Allow non-consecutive offsets · eac0b067

cody koeninger authored 7 years ago

## What changes were proposed in this pull request?

Add a configuration spark.streaming.kafka.allowNonConsecutiveOffsets to allow streaming jobs to proceed on compacted topics (or other situations involving gaps between offsets in the log).

## How was this patch tested?

Added new unit test

justinrmiller has been testing this branch in production for a few weeks

Author: cody koeninger <cody@koeninger.org>

Closes #20572 from koeninger/SPARK-17147.

eac0b067

[SPARK-23509][BUILD] Upgrade commons-net from 2.2 to 3.1 · 649ed9c5

Kazuaki Ishizaki authored 7 years ago

## What changes were proposed in this pull request?

This PR avoids version conflicts of `commons-net` by upgrading commons-net from 2.2 to 3.1. We are seeing the following message during the build using sbt.

```
[warn] Found version conflict(s) in library dependencies; some are suspected to be binary incompatible:
...
[warn] 	* commons-net:commons-net:3.1 is selected over 2.2
[warn] 	    +- org.apache.hadoop:hadoop-common:2.6.5              (depends on 3.1)
[warn] 	    +- org.apache.spark:spark-core_2.11:2.4.0-SNAPSHOT    (depends on 2.2)
[warn]
```

[Here](https://commons.apache.org/proper/commons-net/changes-report.html) is a release history.

[Here](https://commons.apache.org/proper/commons-net/migration.html) is a migration guide from 2.x to 3.0.

## How was this patch tested?

Existing tests

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #20672 from kiszk/SPARK-23509.

649ed9c5

[SPARK-23445] ColumnStat refactoring · 8077bb04

Juliusz Sompolski authored 7 years ago

## What changes were proposed in this pull request?

Refactor ColumnStat to be more flexible.

* Split `ColumnStat` and `CatalogColumnStat` just like `CatalogStatistics` is split from `Statistics`. This detaches how the statistics are stored from how they are processed in the query plan. `CatalogColumnStat` keeps `min` and `max` as `String`, making it not depend on dataType information.
* For `CatalogColumnStat`, parse column names from property names in the metastore (`KEY_VERSION` property), not from metastore schema. This means that `CatalogColumnStat`s can be created for columns even if the schema itself is not stored in the metastore.
* Make all fields optional. `min`, `max` and `histogram` for columns were optional already. Having them all optional is more consistent, and gives flexibility to e.g. drop some of the fields through transformations if they are difficult / impossible to calculate.

The added flexibility will make it possible to have alternative implementations for stats, and separates stats collection from stats and estimation processing in plans.

## How was this patch tested?

Refactored existing tests to work with refactored `ColumnStat` and `CatalogColumnStat`.
New tests added in `StatisticsSuite` checking that backwards / forwards compatibility is not broken.

Author: Juliusz Sompolski <julek@databricks.com>

Closes #20624 from juliuszsompolski/SPARK-23445.

8077bb04