Commits · 12671dd5e468beedc2681ff2bdf95fba81f8f29c · cs525-sp18-g07 / spark

Jun 30, 2015

[SPARK-8434][SQL]Add a "pretty" parameter to the "show" method to display long strings · 12671dd5

zsxwing authored 9 years ago

Sometimes the user may want to show the complete content of cells. Now `sql("set -v").show()` displays:

![screen shot 2015-06-18 at 4 34 51 pm](https://cloud.githubusercontent.com/assets/1000778/8227339/14d3c5ea-15d9-11e5-99b9-f00b7e93beef.png)

The user needs to use something like `sql("set -v").collect().foreach(r => r.toSeq.mkString("\t"))` to show the complete content.

This PR adds a `pretty` parameter to show. If `pretty` is false, `show` won't truncate strings or align cells right.

![screen shot 2015-06-18 at 4 21 44 pm](https://cloud.githubusercontent.com/assets/1000778/8227407/b6f8dcac-15d9-11e5-8219-8079280d76fc.png)

Author: zsxwing <zsxwing@gmail.com>

Closes #6877 from zsxwing/show and squashes the following commits:

22e28e9 [zsxwing] pretty -> truncate
e582628 [zsxwing] Add pretty parameter to the show method in R
a3cd55b [zsxwing] Fix calling showString in R
923cee4 [zsxwing] Add a "pretty" parameter to show to display long strings

12671dd5

[SPARK-5161] [HOTFIX] Fix bug in Python test failure reporting · 6c5a6db4

Josh Rosen authored 9 years ago

This patch fixes a bug introduced in #7031 which can cause Jenkins to incorrectly report a build with failed Python tests as passing if an error occurred while printing the test failure message.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7112 from JoshRosen/python-tests-hotfix and squashes the following commits:

c3f2961 [Josh Rosen] Hotfix for bug in Python test failure reporting

6c5a6db4

[SPARK-8650] [SQL] Use the user-specified app name priority in... · e6c3f746

Yadong Qi authored 9 years ago

[SPARK-8650] [SQL] Use the user-specified app name priority in SparkSQLCLIDriver or HiveThriftServer2

When run `./bin/spark-sql --name query1.sql`
[Before]
![before](https://cloud.githubusercontent.com/assets/1400819/8370336/fa20b75a-1bf8-11e5-9171-040049a53240.png)

[After]
![after](https://cloud.githubusercontent.com/assets/1400819/8370189/dcc35cb4-1bf6-11e5-8796-a0694140bffb.png)

Author: Yadong Qi <qiyadong2010@gmail.com>

Closes #7030 from watermen/SPARK-8650 and squashes the following commits:

51b5134 [Yadong Qi] Improve code and add comment.
e3d7647 [Yadong Qi] use spark.app.name priority.

e6c3f746

[SPARK-8721][SQL] Rename ExpectsInputTypes => AutoCastInputTypes. · f79410c4

Reynold Xin authored 9 years ago

Author: Reynold Xin <rxin@databricks.com>

Closes #7109 from rxin/auto-cast and squashes the following commits:

a914cc3 [Reynold Xin] [SPARK-8721][SQL] Rename ExpectsInputTypes => AutoCastInputTypes.

f79410c4

Jun 29, 2015

MAINTENANCE: Automated closing of pull requests. · ea775b06

Patrick Wendell authored 9 years ago

This commit exists to close the following pull requests on Github:

Closes #1767 (close requested by 'andrewor14')
Closes #6952 (close requested by 'andrewor14')
Closes #7051 (close requested by 'andrewor14')
Closes #5357 (close requested by 'marmbrus')
Closes #5233 (close requested by 'andrewor14')
Closes #6930 (close requested by 'JoshRosen')
Closes #5502 (close requested by 'andrewor14')
Closes #6778 (close requested by 'andrewor14')
Closes #7006 (close requested by 'andrewor14')

ea775b06

[SPARK-5161] Parallelize Python test execution · 7bbbe380

Josh Rosen authored 9 years ago

This commit parallelizes the Python unit test execution, significantly reducing Jenkins build times. Parallelism is now configurable by passing the `-p` or `--parallelism` flags to either `dev/run-tests` or `python/run-tests` (the default parallelism is 4, but I've successfully tested with higher parallelism).

To avoid flakiness, I've disabled the Spark Web UI for the Python tests, similar to what we've done for the JVM tests.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7031 from JoshRosen/parallelize-python-tests and squashes the following commits:

feb3763 [Josh Rosen] Re-enable other tests
f87ea81 [Josh Rosen] Only log output from failed tests
d4ded73 [Josh Rosen] Logging improvements
a2717e1 [Josh Rosen] Make parallelism configurable via dev/run-tests
1bacf1b [Josh Rosen] Merge remote-tracking branch 'origin/master' into parallelize-python-tests
110cd9d [Josh Rosen] Fix universal_newlines for Python 3
cd13db8 [Josh Rosen] Also log python_implementation
9e31127 [Josh Rosen] Log Python --version output for each executable.
a2b9094 [Josh Rosen] Bump up parallelism.
5552380 [Josh Rosen] Python 3 fix
866b5b9 [Josh Rosen] Fix lazy logging warnings in Prospector checks
87cb988 [Josh Rosen] Skip MLLib tests for PyPy
8309bfe [Josh Rosen] Temporarily disable parallelism to debug a failure
9129027 [Josh Rosen] Disable Spark UI in Python tests
037b686 [Josh Rosen] Temporarily disable JVM tests so we can test Python speedup in Jenkins.
af4cef4 [Josh Rosen] Initial attempt at parallelizing Python test execution

7bbbe380

[SPARK-7667] [MLLIB] MLlib Python API consistency check · f9b6bf2f

Yanbo Liang authored 9 years ago

MLlib Python API consistency check

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6856 from yanboliang/spark-7667 and squashes the following commits:

21bae35 [Yanbo Liang] remove duplicate code
eb12f95 [Yanbo Liang] fix doc inherit problem
9e7ec3c [Yanbo Liang] address comments
e763d32 [Yanbo Liang] MLlib Python API consistency check

f9b6bf2f

[SPARK-8669] [SQL] Fix crash with BINARY (ENUM) fields with Parquet 1.7 · 4915e9e3

Steven She authored 9 years ago

Patch to fix crash with BINARY fields with ENUM original types.

Author: Steven She <steven@canopylabs.com>

Closes #7048 from stevencanopy/SPARK-8669 and squashes the following commits:

2e72979 [Steven She] [SPARK-8669] [SQL] Fix crash with BINARY (ENUM) fields with Parquet 1.7

4915e9e3

[SPARK-8715] ArrayOutOfBoundsException fixed for DataFrameStatSuite.crosstab · ecacb1e8

Burak Yavuz authored 9 years ago

cc yhuai

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #7100 from brkyvz/ct-flakiness-fix and squashes the following commits:

abc299a [Burak Yavuz] change 'to' to until
7e96d7c [Burak Yavuz] ArrayOutOfBoundsException fixed for DataFrameStatSuite.crosstab

ecacb1e8

[SPARK-8456] [ML] Ngram featurizer python · 620605a4

Feynman Liang authored 9 years ago

Python API for N-gram feature transformer

Author: Feynman Liang <fliang@databricks.com>

Closes #6960 from feynmanliang/ngram-featurizer-python and squashes the following commits:

f9e37c9 [Feynman Liang] Remove debugging code
4dd81f4 [Feynman Liang] Fix typo and doctest
06c79ac [Feynman Liang] Style guide
26c1175 [Feynman Liang] Add python NGram API

620605a4

Revert "[SPARK-8437] [DOCS] Using directory path without wildcard for filename... · 4c1808be

Andrew Or authored 9 years ago

Revert "[SPARK-8437] [DOCS] Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles"

This reverts commit 5d30eae5.

4c1808be

[SPARK-8019] [SPARKR] Support SparkR spawning worker R processes with a command other then Rscript · 4a9e03fa

Michael Sannella x268 authored 9 years ago

This is a simple change to add a new environment variable
"spark.sparkr.r.command" that specifies the command that SparkR will
use when creating an R engine process.  If this is not specified,
"Rscript" will be used by default.

I did not add any documentation, since I couldn't find any place where
environment variables (such as "spark.sparkr.use.daemon") are
documented.

I also did not add a unit test.  The only test that would work
generally would be one starting SparkR with
sparkR.init(sparkEnvir=list(spark.sparkr.r.command="Rscript")), just
using the default value.  I think that this is a low-risk change.

Likely committers: shivaram

Author: Michael Sannella x268 <msannell@tibco.com>

Closes #6557 from msannell/altR and squashes the following commits:

7eac142 [Michael Sannella x268] add spark.sparkr.r.command config parameter

4a9e03fa

[SPARK-8410] [SPARK-8475] remove previous ivy resolution when using spark-submit · d7f796da

Burak Yavuz authored 9 years ago

This PR also includes re-ordering the order that repositories are used when resolving packages. User provided repositories will be prioritized.

cc andrewor14

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #7089 from brkyvz/delete-prev-ivy-resolution and squashes the following commits:

a21f95a [Burak Yavuz] remove previous ivy resolution when using spark-submit

d7f796da

[SPARK-8437] [DOCS] Using directory path without wildcard for filename slow... · 5d30eae5

Sean Owen authored 9 years ago

[SPARK-8437] [DOCS] Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles

Note that 'dir/*' can be more efficient in some Hadoop FS implementations that 'dir/'

Author: Sean Owen <sowen@cloudera.com>

Closes #7036 from srowen/SPARK-8437 and squashes the following commits:

0e813ae [Sean Owen] Note that 'dir/*' can be more efficient in some Hadoop FS implementations that 'dir/'

5d30eae5

[SPARK-7287] [SPARK-8567] [TEST] Add sc.stop to applications in SparkSubmitSuite · fbf75738

Yin Huai authored 9 years ago

Hopefully, this suite will not be flaky anymore.

Author: Yin Huai <yhuai@databricks.com>

Closes #7027 from yhuai/SPARK-8567 and squashes the following commits:

c0167e2 [Yin Huai] Add sc.stop().

fbf75738

[SPARK-8634] [STREAMING] [TESTS] Fix flaky test StreamingListenerSuite "receiver info reporting" · cec98525

zsxwing authored 9 years ago

As per the unit test log in https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35754/

```
15/06/24 23:09:10.210 Thread-3495 INFO ReceiverTracker: Starting 1 receivers
15/06/24 23:09:10.270 Thread-3495 INFO SparkContext: Starting job: apply at Transformer.scala:22
...
15/06/24 23:09:14.259 ForkJoinPool-4-worker-29 INFO StreamingListenerSuiteReceiver: Started receiver and sleeping
15/06/24 23:09:14.270 ForkJoinPool-4-worker-29 INFO StreamingListenerSuiteReceiver: Reporting error and sleeping
```

it needs at least 4 seconds to receive all receiver events in this slow machine, but `timeout` for `eventually` is only 2 seconds.
This PR increases `timeout` to make this test stable.

Author: zsxwing <zsxwing@gmail.com>

Closes #7017 from zsxwing/SPARK-8634 and squashes the following commits:

719cae4 [zsxwing] Fix flaky test StreamingListenerSuite "receiver info reporting"

cec98525

[SPARK-8589] [SQL] cleanup DateTimeUtils · 881662e9

Wenchen Fan authored 9 years ago

move date time related operations into `DateTimeUtils` and rename some methods to make it more clear.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6980 from cloud-fan/datetime and squashes the following commits:

9373a9d [Wenchen Fan] cleanup DateTimeUtil

881662e9

[SPARK-8710] [SQL] Change ScalaReflection.mirror from a val to a def. · 4b497a72

Yin Huai authored 9 years ago

jira: https://issues.apache.org/jira/browse/SPARK-8710

Author: Yin Huai <yhuai@databricks.com>

Closes #7094 from yhuai/SPARK-8710 and squashes the following commits:

c854baa [Yin Huai] Change ScalaReflection.mirror from a val to a def.

4b497a72

[SPARK-8661][ML] for LinearRegressionSuite.scala, changed javadoc-style... · 4e880cf5

Rosstin authored 9 years ago

[SPARK-8661][ML] for LinearRegressionSuite.scala, changed javadoc-style comments to regular multiline comments, to make copy-pasting R code more simple

for mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala, changed javadoc-style comments to regular multiline comments, to make copy-pasting R code more simple

Author: Rosstin <asterazul@gmail.com>

Closes #7098 from Rosstin/SPARK-8661 and squashes the following commits:

5a05dee [Rosstin] SPARK-8661 for LinearRegressionSuite.scala, changed javadoc-style comments to regular multiline comments to make it easier to copy-paste the R code.
bb9a4b1 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8660
242aedd [Rosstin] SPARK-8660, changed comment style from JavaDoc style to normal multiline comment in order to make copypaste into R easier, in file classification/LogisticRegressionSuite.scala
2cd2985 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639
21ac1e5 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639
6c18058 [Rosstin] fixed minor typos in docs/README.md and docs/api.md

4e880cf5

[SPARK-8579] [SQL] support arbitrary object in UnsafeRow · ed359de5

Davies Liu authored 9 years ago

This PR brings arbitrary object support in UnsafeRow (both in grouping key and aggregation buffer).

Two object pools will be created to hold those non-primitive objects, and put the index of them into UnsafeRow. In order to compare the grouping key as bytes, the objects in key will be stored in a unique object pool, to make sure same objects will have same index (used as hashCode).

For StringType and BinaryType, we still put them as var-length in UnsafeRow when initializing for better performance. But for update, they will be an object inside object pools (there will be some garbages left in the buffer).

BTW: Will create a JIRA once issue.apache.org is available.

cc JoshRosen rxin

Author: Davies Liu <davies@databricks.com>

Closes #6959 from davies/unsafe_obj and squashes the following commits:

5ce39da [Davies Liu] fix comment
5e797bf [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_obj
5803d64 [Davies Liu] fix conflict
461d304 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_obj
2f41c90 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_obj
b04d69c [Davies Liu] address comments
4859b80 [Davies Liu] fix comments
f38011c [Davies Liu] add a test for grouping by decimal
d2cf7ab [Davies Liu] add more tests for null checking
71983c5 [Davies Liu] add test for timestamp
e8a1649 [Davies Liu] reuse buffer for string
39f09ca [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_obj
035501e [Davies Liu] fix style
236d6de [Davies Liu] support arbitrary object in UnsafeRow

ed359de5

[SPARK-8478] [SQL] Harmonize UDF-related code to use uniformly UDF instead of Udf · 931da5c8

BenFradet authored 9 years ago

Follow-up of #6902 for being coherent between ```Udf``` and ```UDF```

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #6920 from BenFradet/SPARK-8478 and squashes the following commits:

c500f29 [BenFradet] renamed a few variables in functions to use UDF
8ab0f2d [BenFradet] renamed idUdf to idUDF in SQLQuerySuite
98696c2 [BenFradet] renamed originalUdfs in TestHive to originalUDFs
7738f74 [BenFradet] modified HiveUDFSuite to use only UDF
c52608d [BenFradet] renamed HiveUdfSuite to HiveUDFSuite
e51b9ac [BenFradet] renamed ExtractPythonUdfs to ExtractPythonUDFs
8c756f1 [BenFradet] renamed Hive UDF related code
2a1ca76 [BenFradet] renamed pythonUdfs to pythonUDFs
261e6fb [BenFradet] renamed ScalaUdf to ScalaUDF

931da5c8

[SPARK-8660][ML] Convert JavaDoc style comments... · c8ae887e

Rosstin authored 9 years ago

[SPARK-8660][ML] Convert JavaDoc style comments inLogisticRegressionSuite.scala to regular multiline comments, to make copy-pasting R commands easier

Converted JavaDoc style comments in mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala to regular multiline comments, to make copy-pasting R commands easier.

Author: Rosstin <asterazul@gmail.com>

Closes #7096 from Rosstin/SPARK-8660 and squashes the following commits:

242aedd [Rosstin] SPARK-8660, changed comment style from JavaDoc style to normal multiline comment in order to make copypaste into R easier, in file classification/LogisticRegressionSuite.scala
2cd2985 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639
21ac1e5 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639
6c18058 [Rosstin] fixed minor typos in docs/README.md and docs/api.md

c8ae887e

[SPARK-7810] [PYSPARK] solve python rdd socket connection problem · ecd3aacf

Ai He authored 9 years ago

Method "_load_from_socket" in rdd.py cannot load data from jvm socket when ipv6 is used. The current method only works well with ipv4. New modification should work around both two protocols.

Author: Ai He <ai.he@ussuning.com>
Author: AiHe <ai.he@ussuning.com>

Closes #6338 from AiHe/pyspark-networking-issue and squashes the following commits:

d4fc9c4 [Ai He] handle code review 2
e75c5c8 [Ai He] handle code review
5644953 [AiHe] solve python rdd socket connection problem to jvm

ecd3aacf

[SPARK-8056][SQL] Design an easier way to construct schema for both Scala and Python · f6fc254e

Ilya Ganelin authored 9 years ago

I've added functionality to create new StructType similar to how we add parameters to a new SparkContext.

I've also added tests for this type of creation.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #6686 from ilganeli/SPARK-8056B and squashes the following commits:

27c1de1 [Ilya Ganelin] Rename
467d836 [Ilya Ganelin] Removed from_string in favor of _parse_Datatype_json_value
5fef5a4 [Ilya Ganelin] Updates for type parsing
4085489 [Ilya Ganelin] Style errors
3670cf5 [Ilya Ganelin] added string to DataType conversion
8109e00 [Ilya Ganelin] Fixed error in tests
41ab686 [Ilya Ganelin] Fixed style errors
e7ba7e0 [Ilya Ganelin] Moved some python tests to tests.py. Added cleaner handling of null data type and added test for correctness of input format
15868fa [Ilya Ganelin] Fixed python errors
b79b992 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-8056B
a3369fc [Ilya Ganelin] Fixing space errors
e240040 [Ilya Ganelin] Style
bab7823 [Ilya Ganelin] Constructor error
73d4677 [Ilya Ganelin] Style
4ed00d9 [Ilya Ganelin] Fixed default arg
67df57a [Ilya Ganelin] Removed Foo
04cbf0c [Ilya Ganelin] Added comments for single object
0484d7a [Ilya Ganelin] Restored second method
6aeb740 [Ilya Ganelin] Style
689e54d [Ilya Ganelin] Style
f497e9e [Ilya Ganelin] Got rid of old code
e3c7a88 [Ilya Ganelin] Fixed doctest failure
a62ccde [Ilya Ganelin] Style
966ac06 [Ilya Ganelin] style checks
dabb7e6 [Ilya Ganelin] Added Python tests
a3f4152 [Ilya Ganelin] added python bindings and better comments
e6e536c [Ilya Ganelin] Added extra space
7529a2e [Ilya Ganelin] Fixed formatting
d388f86 [Ilya Ganelin] Fixed small bug
c4e3bf5 [Ilya Ganelin] Reverted to using parse. Updated parse to support long
d7634b6 [Ilya Ganelin] Reverted to fromString to properly support types
22c39d5 [Ilya Ganelin] replaced FromString with DataTypeParser.parse. Replaced empty constructor initializing a null to have it instead create a new array to allow appends to it.
faca398 [Ilya Ganelin] [SPARK-8056] Replaced default argument usage. Updated usage and code for DataType.fromString
1acf76e [Ilya Ganelin] Scala style
e31c674 [Ilya Ganelin] Fixed bug in test
8dc0795 [Ilya Ganelin] Added tests for creation of StructType object with new methods
fdf7e9f [Ilya Ganelin] [SPARK-8056] Created add methods to facilitate building new StructType objects.

f6fc254e

[SPARK-8709] Exclude hadoop-client's mockito-all dependency · 27ef8545

Josh Rosen authored 9 years ago

This patch excludes `hadoop-client`'s dependency on `mockito-all`. As of #7061, Spark depends on `mockito-core` instead of `mockito-all`, so the dependency from Hadoop was leading to test compilation failures for some of the Hadoop 2 SBT builds.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7090 from JoshRosen/SPARK-8709 and squashes the following commits:

e190122 [Josh Rosen] [SPARK-8709] Exclude hadoop-client's mockito-all dependency.

27ef8545

[SPARK-8070] [SQL] [PYSPARK] avoid spark jobs in createDataFrame · afae9766

Davies Liu authored 9 years ago

Avoid the unnecessary jobs when infer schema from list.

cc yhuai mengxr

Author: Davies Liu <davies@databricks.com>

Closes #6606 from davies/improve_create and squashes the following commits:

a5928bf [Davies Liu] Update MimaExcludes.scala
62da911 [Davies Liu] fix mima
bab4d7d [Davies Liu] Merge branch 'improve_create' of github.com:davies/spark into improve_create
eee44a8 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_create
8d9292d [Davies Liu] Update context.py
eb24531 [Davies Liu] Update context.py
c969997 [Davies Liu] bug fix
d5a8ab0 [Davies Liu] fix tests
8c3f10d [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_create
6ea5925 [Davies Liu] address comments
6ceaeff [Davies Liu] avoid spark jobs in createDataFrame

afae9766

[SPARK-8681] fixed wrong ordering of columns in crosstab · be7ef067

Burak Yavuz authored 9 years ago

I specifically randomized the test. What crosstab does is equivalent to a countByKey, therefore if this test fails again for any reason, we will know that we hit a corner case or something.

cc rxin marmbrus

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #7060 from brkyvz/crosstab-fixes and squashes the following commits:

0a65234 [Burak Yavuz] addressed comments v1
d96da7e [Burak Yavuz] fixed wrong ordering of columns in crosstab

be7ef067

[SPARK-7862] [SQL] Disable the error message redirect to stderr · c6ba2ea3

Cheng Hao authored 9 years ago

This is a follow up of #6404, the ScriptTransformation prints the error msg into stderr directly, probably be a disaster for application log.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #6882 from chenghao-intel/verbose and squashes the following commits:

bfedd77 [Cheng Hao] revert the write
76ff46b [Cheng Hao] update the CircularBuffer
692b19e [Cheng Hao] check the process exitValue for ScriptTransform
47e0970 [Cheng Hao] Use the RedirectThread instead
1de771d [Cheng Hao] naming the threads in ScriptTransformation
8536e81 [Cheng Hao] disable the error message redirection for stderr

c6ba2ea3

[SPARK-8214] [SQL] Add function hex · 637b4eed

zhichao.li authored 9 years ago

cc chenghao-intel  adrian-wang

Author: zhichao.li <zhichao.li@intel.com>

Closes #6976 from zhichao-li/hex and squashes the following commits:

e218d1b [zhichao.li] turn off scalastyle for non-ascii
de3f5ea [zhichao.li] non-ascii char
cf9c936 [zhichao.li] give separated buffer for each hex method
967ec90 [zhichao.li] Make 'value' as a feild of Hex
3b2fa13 [zhichao.li] tiny fix
a647641 [zhichao.li] remove duplicate null check
7cab020 [zhichao.li] tiny refactoring
35ecfe5 [zhichao.li] add function hex

637b4eed

[SQL][DOCS] Remove wrong example from DataFrame.scala · 94e040d0

Kousuke Saruta authored 9 years ago

In DataFrame.scala, there are examples like as follows.

```
 * // The following are equivalent:
 * peopleDf.filter($"age" > 15)
 * peopleDf.where($"age" > 15)
 * peopleDf($"age" > 15)
```

But, I think the last example doesn't work.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #6977 from sarutak/fix-dataframe-example and squashes the following commits:

46efbd7 [Kousuke Saruta] Removed wrong example

94e040d0

[SPARK-8528] Expose SparkContext.applicationId in PySpark · 492dca3a

Vladimir Vladimirov authored 9 years ago

Use case - we want to log applicationId (YARN in hour case) to request help with troubleshooting from the DevOps

Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com>

Closes #6936 from smartkiwi/master and squashes the following commits:

870338b [Vladimir Vladimirov] this would make doctest to run in python3
0eae619 [Vladimir Vladimirov] Scala doesn't use u'...' for unicode literals
14d77a8 [Vladimir Vladimirov] stop using ELLIPSIS
b4ebfc5 [Vladimir Vladimirov] addressed PR feedback - updated docstring
223a32f [Vladimir Vladimirov] fixed test - applicationId is property that returns the string
3221f5a [Vladimir Vladimirov] [SPARK-8528] added documentation for Scala
2cff090 [Vladimir Vladimirov] [SPARK-8528] add applicationId property for SparkContext object in pyspark

492dca3a

[SPARK-8235] [SQL] misc function sha / sha1 · a5c2961c

Tarek Auel authored 9 years ago

Jira: https://issues.apache.org/jira/browse/SPARK-8235

I added the support for sha1. If I understood rxin correctly, sha and sha1 should execute the same algorithm, shouldn't they?

Please take a close look on the Python part. This is adopted from #6934

Author: Tarek Auel <tarek.auel@gmail.com>
Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #6963 from tarekauel/SPARK-8235 and squashes the following commits:

f064563 [Tarek Auel] change to shaHex
7ce3cdc [Tarek Auel] rely on automatic cast
a1251d6 [Tarek Auel] Merge remote-tracking branch 'upstream/master' into SPARK-8235
68eb043 [Tarek Auel] added docstring
be5aff1 [Tarek Auel] improved error message
7336c96 [Tarek Auel] added type check
cf23a80 [Tarek Auel] simplified example
ebf75ef [Tarek Auel] [SPARK-8301] updated the python documentation. Removed sha in python and scala
6d6ff0d [Tarek Auel] [SPARK-8233] added docstring
ea191a9 [Tarek Auel] [SPARK-8233] fixed signatureof python function. Added expected type to misc
e3fd7c3 [Tarek Auel] SPARK[8235] added sha to the list of __all__
e5dad4e [Tarek Auel] SPARK[8235] sha / sha1

a5c2961c

[SPARK-8066, SPARK-8067] [hive] Add support for Hive 1.0, 1.1 and 1.2. · 3664ee25

Marcelo Vanzin authored 9 years ago

Allow HiveContext to connect to metastores of those versions; some new shims
had to be added to account for changing internal APIs.

A new test was added to exercise the "reset()" path which now also requires
a shim; and the test code was changed to use a directory under the build's
target to store ivy dependencies. Without that, at least I consistently run
into issues with Ivy messing up (or being confused) by my existing caches.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7026 from vanzin/SPARK-8067 and squashes the following commits:

3e2e67b [Marcelo Vanzin] [SPARK-8066, SPARK-8067] [hive] Add support for Hive 1.0, 1.1 and 1.2.

3664ee25

[SPARK-8692] [SQL] re-order the case statements that handling catalyst data types · ed413bcc

Wenchen Fan authored 9 years ago

use same order: boolean, byte, short, int, date, long, timestamp, float, double, string, binary, decimal.

Then we can easily check whether some data types are missing by just one glance, and make sure we handle data/timestamp just as int/long.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7073 from cloud-fan/fix-date and squashes the following commits:

463044d [Wenchen Fan] fix style
51cd347 [Wenchen Fan] refactor handling of date and timestmap

ed413bcc

Revert "[SPARK-8372] History server shows incorrect information for application not started" · ea88b1a5
Andrew Or authored 9 years ago
```
This reverts commit 2837e067.
```
ea88b1a5

[SPARK-8554] Add the SparkR document files to `.rat-excludes` for `./dev/check-license` · 715f084c

Yu ISHIKAWA authored 9 years ago

[[SPARK-8554] Add the SparkR document files to `.rat-excludes` for `./dev/check-license` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8554)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6947 from yu-iskw/SPARK-8554 and squashes the following commits:

5ca240c [Yu ISHIKAWA] [SPARK-8554] Add the SparkR document files to `.rat-excludes` for `./dev/check-license`

715f084c

[SPARK-8693] [PROJECT INFRA] profiles and goals are not printed in a nice way · 5c796d57

Brennon York authored 9 years ago

Hotfix to correct formatting errors of print statements within the dev and jenkins builds. Error looks like:

```
-Phadoop-1[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments:  -Dhadoop.version=1.0.4[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments:  -Pkinesis-asl[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments:  -Phive-thriftserver[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments:  -Phive[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments:  package[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments:  assembly/assembly[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments:  streaming-kafka-assembly/assembly
```

Author: Brennon York <brennon.york@capitalone.com>

Closes #7085 from brennonyork/SPARK-8693 and squashes the following commits:

c5575f1 [Brennon York] added commas to end of print statements for proper printing

5c796d57

[SPARK-8702] [WEBUI] Avoid massive concating strings in Javascript · 630bd5fd

zsxwing authored 9 years ago

When there are massive tasks, such as `sc.parallelize(1 to 100000, 10000).count()`, the generated JS codes have a lot of string concatenations in the stage page, nearly 40 string concatenations for one task.

We can generate the whole string for a task instead of execution string concatenations in the browser.

Before this patch, the load time of the page is about 21 seconds.
![screen shot 2015-06-29 at 6 44 04 pm](https://cloud.githubusercontent.com/assets/1000778/8406644/eb55ed18-1e90-11e5-9ad5-50d27ad1dff1.png)

After this patch, it reduces to about 17 seconds.

![screen shot 2015-06-29 at 6 47 34 pm](https://cloud.githubusercontent.com/assets/1000778/8406665/087003ca-1e91-11e5-80a8-3485aa9adafa.png)

One disadvantage is that the generated JS codes become hard to read.

Author: zsxwing <zsxwing@gmail.com>

Closes #7082 from zsxwing/js-string and squashes the following commits:

b29231d [zsxwing] Avoid massive concating strings in Javascript

630bd5fd

[SPARK-8698] partitionBy in Python DataFrame reader/writer interface should... · 660c6cec

Reynold Xin authored 9 years ago

[SPARK-8698] partitionBy in Python DataFrame reader/writer interface should not default to empty tuple.

Author: Reynold Xin <rxin@databricks.com>

Closes #7079 from rxin/SPARK-8698 and squashes the following commits:

8513e1c [Reynold Xin] [SPARK-8698] partitionBy in Python DataFrame reader/writer interface should not default to empty tuple.

660c6cec

[SPARK-8355] [SQL] Python DataFrameReader/Writer should mirror Scala · ac2e17b0

Cheolsoo Park authored 9 years ago

I compared PySpark DataFrameReader/Writer against Scala ones. `Option` function is missing in both reader and writer, but the rest seems to all match.

I added `Option` to reader and writer and updated the `pyspark-sql` test.

Author: Cheolsoo Park <cheolsoop@netflix.com>

Closes #7078 from piaozhexiu/SPARK-8355 and squashes the following commits:

c63d419 [Cheolsoo Park] Fix version
524e0aa [Cheolsoo Park] Add option function to df reader and writer

ac2e17b0