Commits · 680fd87c65e3e7ef223e6a1573c7afe55bff6324 · cs525-sp18-g07 / spark

Nov 01, 2014

Upgrading to roaring 0.4.5 (bug fix release) · 680fd87c

Daniel Lemire authored 10 years ago

I recommend upgrading roaring to 0.4.5 as it fixes a rarely occurring bug in iterators (that would otherwise throw an unwarranted exception). The upgrade should have no other consequence.

Author: Daniel Lemire <lemire@gmail.com>

Closes #3044 from lemire/master and squashes the following commits:

54018c5 [Daniel Lemire] Recommended update to roaring 0.4.5 (bug fix release)
048933e [Daniel Lemire] Merge remote-tracking branch 'upstream/master'
431f3a0 [Daniel Lemire] Recommended bug fix release

680fd87c

Streaming KMeans [MLLIB][SPARK-3254] · 98c556eb

freeman authored 10 years ago

This adds a Streaming KMeans algorithm to MLlib. It uses an update rule that generalizes the mini-batch KMeans update to incorporate a decay factor, which allows past data to be forgotten. The decay factor can be specified explicitly, or via a more intuitive "fractional decay" setting, in units of either data points or batches.

The PR includes:
- StreamingKMeans algorithm with decay factor settings
- Usage example
- Additions to documentation clustering page
- Unit tests of basic behavior and decay behaviors

tdas mengxr rezazadeh

Author: freeman <the.freeman.lab@gmail.com>
Author: Jeremy Freeman <the.freeman.lab@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #2942 from freeman-lab/streaming-kmeans and squashes the following commits:

b2e5b4a [freeman] Fixes to docs / examples
078617c [Jeremy Freeman] Merge pull request #1 from mengxr/SPARK-3254
2e682c0 [Xiangrui Meng] take discount on previous weights; use BLAS; detect dying clusters
0411bf5 [freeman] Change decay parameterization
9f7aea9 [freeman] Style fixes
374a706 [freeman] Formatting
ad9bdc2 [freeman] Use labeled points and predictOnValues in examples
77dbd3f [freeman] Make initialization check an assertion
9cfc301 [freeman] Make random seed an argument
44050a9 [freeman] Simpler constructor
c7050d5 [freeman] Fix spacing
2899623 [freeman] Use pattern matching for clarity
a4a316b [freeman] Use collect
1472ec5 [freeman] Doc formatting
ea22ec8 [freeman] Fix imports
2086bdc [freeman] Log cluster center updates
ea9877c [freeman] More documentation
9facbe3 [freeman] Bug fix
5db7074 [freeman] Example usage for StreamingKMeans
f33684b [freeman] Add explanation and example to docs
b5b5f8d [freeman] Add better documentation
a0fd790 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans
9fd9c15 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans
b93350f [freeman] Streaming KMeans with decay

98c556eb

Oct 31, 2014

[MLLIB] SPARK-1547: Add Gradient Boosting to MLlib · 86021955

Manish Amde authored 10 years ago

Given the popular demand for gradient boosting and AdaBoost in MLlib, I am creating a WIP branch for early feedback on gradient boosting with AdaBoost to follow soon after this PR is accepted. This is based on work done along with hirakendu that was pending due to decision tree optimizations and random forests work.

Ideally, boosting algorithms should work with any base learners.  This will soon be possible once the MLlib API is finalized -- we want to ensure we use a consistent interface for the underlying base learners. In the meantime, this PR uses decision trees as base learners for the gradient boosting algorithm. The current PR allows "pluggable" loss functions and provides least squares error and least absolute error by default.

Here is the task list:
- [x] Gradient boosting support
- [x] Pluggable loss functions
- [x] Stochastic gradient boosting support – Re-use the BaggedPoint approach used for RandomForest.
- [x] Binary classification support
- [x] Support configurable checkpointing – This approach will avoid long lineage chains.
- [x] Create classification and regression APIs
- [x] Weighted Ensemble Model -- created a WeightedEnsembleModel class that can be used by ensemble algorithms such as random forests and boosting.
- [x] Unit Tests

Future work:
+ Multi-class classification is currently not supported by this PR since it requires discussion on the best way to support "deviance" as a loss function.
+ BaggedRDD caching -- Avoid repeating feature to bin mapping for each tree estimator after standard API work is completed.

cc: jkbradley hirakendu mengxr etrain atalwalkar chouqin

Author: Manish Amde <manish9ue@gmail.com>
Author: manishamde <manish9ue@gmail.com>

Closes #2607 from manishamde/gbt and squashes the following commits:

991c7b5 [Manish Amde] public api
ff2a796 [Manish Amde] addressing comments
b4c1318 [Manish Amde] removing spaces
8476b6b [Manish Amde] fixing line length
0183cb9 [Manish Amde] fixed naming and formatting issues
1c40c33 [Manish Amde] add newline, removed spaces
e33ab61 [Manish Amde] minor comment
eadbf09 [Manish Amde] parameter renaming
035a2ed [Manish Amde] jkbradley formatting suggestions
9f7359d [Manish Amde] simplified gbt logic and added more tests
49ba107 [Manish Amde] merged from master
eff21fe [Manish Amde] Added gradient boosting tests
3fd0528 [Manish Amde] moved helper methods to new class
a32a5ab [Manish Amde] added test for subsampling without replacement
781542a [Manish Amde] added support for fractional subsampling with replacement
3a18cc1 [Manish Amde] cleaned up api for conversion to bagged point and moved tests to it's own test suite
0e81906 [Manish Amde] improving caching unpersisting logic
d971f73 [Manish Amde] moved RF code to use WeightedEnsembleModel class
fee06d3 [Manish Amde] added weighted ensemble model
1b01943 [Manish Amde] add weights for base learners
9bc6e74 [Manish Amde] adding random seed as parameter
d2c8323 [Manish Amde] Merge branch 'master' into gbt
2ae97b7 [Manish Amde] added documentation for the loss classes
9366b8f [Manish Amde] minor: using numTrees instead of trees.size
3b43896 [Manish Amde] added learning rate for prediction
9b2e35e [Manish Amde] Merge branch 'master' into gbt
6a11c02 [manishamde] fixing formatting
823691b [Manish Amde] fixing RF test
1f47941 [Manish Amde] changing access modifier
5b67102 [Manish Amde] shortened parameter list
5ab3796 [Manish Amde] minor reformatting
9155a9d [Manish Amde] consolidated boosting configuration and added public API
631baea [Manish Amde] Merge branch 'master' into gbt
2cb1258 [Manish Amde] public API support
3b8ffc0 [Manish Amde] added documentation
8e10c63 [Manish Amde] modified unpersist strategy
f62bc48 [Manish Amde] added unpersist
bdca43a [Manish Amde] added timing parameters
2fbc9c7 [Manish Amde] fixing binomial classification prediction
6dd4dd8 [Manish Amde] added support for log loss
9af0231 [Manish Amde] classification attempt
62cc000 [Manish Amde] basic checkpointing
4784091 [Manish Amde] formatting
78ed452 [Manish Amde] added newline and fixed if statement
3973dd1 [Manish Amde] minor indicating subsample is double during comparison
aa8fae7 [Manish Amde] minor refactoring
1a8031c [Manish Amde] sampling with replacement
f1c9ef7 [Manish Amde] Merge branch 'master' into gbt
cdceeef [Manish Amde] added documentation
6251fd5 [Manish Amde] modified method name
5538521 [Manish Amde] disable checkpointing for now
0ae1c0a [Manish Amde] basic gradient boosting code from earlier branches

86021955

[SPARK-3838][examples][mllib][python] Word2Vec example in python · e07fb6a4

Anant authored 10 years ago

This pull request refers to issue: https://issues.apache.org/jira/browse/SPARK-3838

Python example for word2vec
mengxr

Author: Anant <anant.asty@gmail.com>

Closes #2952 from anantasty/SPARK-3838 and squashes the following commits:

87bd723 [Anant] remove stop line
4bd439e [Anant] Changes as per code review. Fized error in word2vec python example, simplified example in docs.
3d3c9ee [Anant] Added empty line after python imports
0c90c31 [Anant] Fixed erroneous code. I was still treating each line to be a single word instead of 16 words
ee4f5f6 [Anant] Fixes from code review comments
c637bcf [Anant] Added word2vec python example to docs
269f31f [Anant] added example in docs
c015b14 [Anant] Added python example for word2vec

e07fb6a4

[MLLIB] SPARK-2329 Add multi-label evaluation metrics · 62d01d25

Alexander Ulanov authored 10 years ago

Implementation of various multi-label classification measures, including: Hamming-loss, strict and default Accuracy, macro-averaged Precision, Recall and F1-measure based on documents and labels, micro-averaged measures: https://issues.apache.org/jira/browse/SPARK-2329

Multi-class measures are currently in the following pull request: https://github.com/apache/spark/pull/1155

Author: Alexander Ulanov <nashb@yandex.ru>
Author: avulanov <nashb@yandex.ru>

Closes #1270 from avulanov/multilabelmetrics and squashes the following commits:

fc8175e [Alexander Ulanov] Merge with previous updates
43a613e [Alexander Ulanov] Addressing reviewers comments: change Set to Array
517a594 [avulanov] Addressing reviewers comments: Scala style
cf4222bc [avulanov] Addressing reviewers comments: renaming. Added label method that returns the list of labels
1843f73 [Alexander Ulanov] Scala style fix
79e8476 [Alexander Ulanov] Replacing fold(_ + _) with sum as suggested by srowen
ca46765 [Alexander Ulanov] Cosmetic changes: Apache header and parameter explanation
40593f5 [Alexander Ulanov] Multi-label metrics: Hamming-loss, strict and normal accuracy, fix to macro measures, bunch of tests
ad62df0 [Alexander Ulanov] Comments and scala style check
154164b [Alexander Ulanov] Multilabel evaluation metics and tests: macro precision and recall averaged by docs, micro and per-class precision and recall averaged by class

62d01d25

SPARK-4175. Exception on stage page · 23f73f52

Sandy Ryza authored 10 years ago

Author: Sandy Ryza <sandy@cloudera.com>

Closes #3043 from sryza/sandy-spark-4175 and squashes the following commits:

e327340 [Sandy Ryza] SPARK-4175. Exception on stage page

23f73f52

[HOT FIX] Yarn stable tests don't compile · 087e31a7

andrewor14 authored 10 years ago

This is caused by this commit: acd4ac7c

Author: andrewor14 <andrew@databricks.com>
Author: Andrew Or <andrew@databricks.com>

Closes #3041 from andrewor14/yarn-hot-fix and squashes the following commits:

e5deba1 [andrewor14] Add new line at the end (minor)
aa998e8 [Andrew Or] Compilation hot fix

087e31a7

[SPARK-3870] EOL character enforcement · 55ab7770

Kousuke Saruta authored 10 years ago

We have shell scripts and Windows batch files, so we should enforce proper EOL character.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2726 from sarutak/eol-enforcement and squashes the following commits:

9748c3f [Kousuke Saruta] Fixed make.bat
252de89 [Kousuke Saruta] Removed extra characters from make.bat
5b81c00 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into eol-enforcement
8633ed2 [Kousuke Saruta] merge branch 'master' of git://git.apache.org/spark into eol-enforcement
5d630d8 [Kousuke Saruta] Merged
ba10797 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into eol-enforcement
7407515 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into eol-enforcement
772fd4e [Kousuke Saruta] Normized EOL character in make.bat and compute-classpath.cmd
ac7f873 [Kousuke Saruta] Added an entry for .gitattributes to .rat-excludes
1570e77 [Kousuke Saruta] Added .gitattributes

55ab7770

[SPARK-4150][PySpark] return self in rdd.setName · f1e7361f

Xiangrui Meng authored 10 years ago

Then we can do `rdd.setName('abc').cache().count()`.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3011 from mengxr/rdd-setname and squashes the following commits:

10d0d60 [Xiangrui Meng] update test
4ac3bbd [Xiangrui Meng] return self in rdd.setName

f1e7361f

[SPARK-4141] Hide Accumulators column on stage page when no accumulators exist · a68ecf32

Mark Mims authored 10 years ago

WebUI

Author: Mark Mims <mark.mims@canonical.com>

This patch had conflicts when merged, resolved by
Committer: Josh Rosen <joshrosen@databricks.com>

Closes #3031 from mmm/remove-accumulators-col and squashes the following commits:

6141cb3 [Mark Mims] reformat to satisfy scalastyle linelength. build failed from jenkins https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22604/
390893b [Mark Mims] cleanup
c28c449 [Mark Mims] looking much better now... minimal explicit formatting. Now, see if any sort keys make sense
fb72156 [Mark Mims] mimic hasInput. The basics work here, but wanna clean this up with maybeAccumulators for column content

a68ecf32

[SPARK-2220][SQL] Fixes remaining Hive commands · 23468e7e

Cheng Lian authored 10 years ago

This PR adds support for the `ADD FILE` Hive command, and removes `ShellCommand` and `SourceCommand`. The reason is described in [this SPARK-2220 comment](https://issues.apache.org/jira/browse/SPARK-2220?focusedCommentId=14191841&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14191841).

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #3038 from liancheng/hive-commands and squashes the following commits:

6db61e0 [Cheng Lian] Fixes remaining Hive commands

23468e7e

[SPARK-4154][SQL] Query does not work if it has "not between " in Spark SQL and HQL · ea465af1

ravipesala authored 10 years ago

if the query contains "not between" does not work like.
SELECT * FROM src where key not between 10 and 20'

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #3017 from ravipesala/SPARK-4154 and squashes the following commits:

65fc89e [ravipesala] Handled admin comments
32e6d42 [ravipesala] 'not between' is not working

ea465af1

[SPARK-4077][SQL] Spark SQL return wrong values for valid string timestamp values · fa712b30

Venkata Ramana Gollamudi authored 10 years ago

In org.apache.hadoop.hive.serde2.io.TimestampWritable.set , if the next entry is null then current time stamp object is being reset.
However because of this hiveinspectors:unwrap cannot use the same timestamp object without creating a copy.

Author: Venkata Ramana G <ramana.gollamudihuawei.com>

Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>

Closes #3019 from gvramana/spark_4077 and squashes the following commits:

32d818f [Venkata Ramana Gollamudi] fixed check style
fa01e71 [Venkata Ramana Gollamudi] cloned timestamp object as org.apache.hadoop.hive.serde2.io.TimestampWritable.set will reset current time object

fa712b30

[SPARK-3826][SQL]enable hive-thriftserver to support hive-0.13.1 · 7c41d135

wangfei authored 10 years ago

In #2241 hive-thriftserver is not enabled. This patch enable hive-thriftserver to support hive-0.13.1 by using a shim layer refer to #2241.

1 A light shim layer(code in sql/hive-thriftserver/hive-version) for each different hive version to handle api compatibility

2 New pom profiles "hive-default" and "hive-versions"(copy from #2241) to activate different hive version

3 SBT cmd for different version as follows:
hive-0.12.0 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.12.0 assembly
hive-0.13.1 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.13.1 assembly

4 Since hive-thriftserver depend on hive subproject, this patch should be merged with #2241 to enable hive-0.13.1 for hive-thriftserver

Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>

Closes #2685 from scwf/shim-thriftserver1 and squashes the following commits:

f26f3be [wangfei] remove clean to save time
f5cac74 [wangfei] remove local hivecontext test
578234d [wangfei] use new shaded hive
18fb1ff [wangfei] exclude kryo in hive pom
fa21d09 [wangfei] clean package assembly/assembly
8a4daf2 [wangfei] minor fix
0d7f6cf [wangfei] address comments
f7c93ae [wangfei] adding build with hive 0.13 before running tests
bcf943f [wangfei] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1
c359822 [wangfei] reuse getCommandProcessor in hiveshim
52674a4 [scwf] sql/hive included since examples depend on it
3529e98 [scwf] move hive module to hive profile
f51ff4e [wangfei] update and fix conflicts
f48d3a5 [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1
41f727b [scwf] revert pom changes
13afde0 [scwf] fix small bug
4b681f4 [scwf] enable thriftserver in profile hive-0.13.1
0bc53aa [scwf] fixed when result filed is null
dfd1c63 [scwf] update run-tests to run hive-0.12.0 default now
c6da3ce [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver
7c66b8e [scwf] update pom according spark-2706
ae47489 [scwf] update and fix conflicts

7c41d135

[SPARK-4016] Allow user to show/hide UI metrics. · adb6415c

Kay Ousterhout authored 10 years ago

This commit adds a set of checkboxes to the stage detail
page that the user can use to show additional task metrics,
including the GC time, result serialization time, result fetch
time, and scheduler delay.  All of these metrics are now
hidden by default.  This allows advanced users to look at more
detailed metrics, without distracting the average user.

This change also cleans up the stage detail page so that metrics
are shown in the same order in the summary table as in the task table,
and updates the metrics in both tables such that they contain the same
set of metrics.

The ability to remember a user's preferences for which metrics
should be shown has been filed as SPARK-4024.

Here's what the stage detail page looks like by default:
![image](https://cloud.githubusercontent.com/assets/1108612/4744322/3ebe319e-5a2f-11e4-891f-c792be79caa2.png)

and once a user clicks "Show additional metrics" (note that all the metrics get checked by default):
![image](https://cloud.githubusercontent.com/assets/1108612/4744332/51e5abda-5a2f-11e4-8994-d0d3705ee05d.png)

cc shivaram andrewor14

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #2867 from kayousterhout/SPARK-checkboxes and squashes the following commits:

6015913 [Kay Ousterhout] Added comment
08dee73 [Kay Ousterhout] Josh's usability comments
0940d61 [Kay Ousterhout] Style updates based on Andrew's review
ef05ccd [Kay Ousterhout] Added tooltips
d7cfaaf [Kay Ousterhout] Made list of add'l metrics collapsible.
70c1fb5 [Kay Ousterhout] [SPARK-4016] Allow user to show/hide UI metrics.

adb6415c

SPARK-3837. Warn when YARN kills containers for exceeding memory limits · acd4ac7c

Sandy Ryza authored 10 years ago

I triggered the issue and verified the message gets printed on a pseudo-distributed cluster.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #2744 from sryza/sandy-spark-3837 and squashes the following commits:

858a268 [Sandy Ryza] Review feedback
c937f00 [Sandy Ryza] SPARK-3837. Warn when YARN kills containers for exceeding memory limits

acd4ac7c

[SPARK-4143] [SQL] Move inner class DeferredObjectAdapter to top level · 58a6077e

Cheng Hao authored 10 years ago

The class DeferredObjectAdapter is the inner class of HiveGenericUdf, which may cause some overhead in closure ser/de-ser. Move it to top level.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3007 from chenghao-intel/move_deferred and squashes the following commits:

3a139b1 [Cheng Hao] Move inner class DeferredObjectAdapter to top level

58a6077e

[SPARK-4108][SQL] Fixed usage of deprecated in sql/catalyst/types/datatypes · d31517a3

Anant authored 10 years ago

Fixed usage of deprecated in sql/catalyst/types/datatypes to have versio...n parameter

Author: Anant <anant.asty@gmail.com>

Closes #2970 from anantasty/SPARK-4108 and squashes the following commits:

e92cb01 [Anant] Fixed usage of deprecated in sql/catalyst/types/datatypes to have version parameter

d31517a3

[SPARK-3250] Implement Gap Sampling optimization for random sampling · ad3bd0df

Erik Erlandson authored 10 years ago

More efficient sampling, based on Gap Sampling optimization:
http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/

Author: Erik Erlandson <eerlands@redhat.com>

Closes #2455 from erikerlandson/spark-3250-pr and squashes the following commits:

72496bc [Erik Erlandson] [SPARK-3250] Implement Gap Sampling optimization for random sampling

ad3bd0df

[SPARK-4124] [MLlib] [PySpark] simplify serialization in MLlib Python API · 872fc669

Davies Liu authored 10 years ago

Create several helper functions to call MLlib Java API, convert the arguments to Java type and convert return value to Python object automatically, this simplify serialization in MLlib Python API very much.

After this, the MLlib Python API does not need to deal with serialization details anymore, it's easier to add new API.

cc mengxr

Author: Davies Liu <davies@databricks.com>

Closes #2995 from davies/cleanup and squashes the following commits:

8fa6ec6 [Davies Liu] address comments
16b85a0 [Davies Liu] Merge branch 'master' of github.com:apache/spark into cleanup
43743e5 [Davies Liu] bugfix
731331f [Davies Liu] simplify serialization in MLlib Python API

872fc669

Oct 30, 2014

HOTFIX: Clean up build in network module. · 0734d093

Patrick Wendell authored 10 years ago

This is currently breaking the package build for some people (including me).

This patch does some general clean-up which also fixes the current issue.
- Uses consistent artifact naming
- Adds sbt support for this module
- Changes tests to use scalatest (fixes the original issue[1])

One thing to note, it turns out that scalatest when invoked in the
Maven build doesn't succesfully detect JUnit Java tests. This is
a long standing issue, I noticed it applies to all of our current
test suites as well. I've created SPARK-4159 to fix this.

[1] The original issue is that we need to allocate extra memory
for the tests, happens by default in our scalatest configuration.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #3025 from pwendell/hotfix and squashes the following commits:

faa9053 [Patrick Wendell] HOTFIX: Clean up build in network module.

0734d093

Revert "SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use... · 26d31d15

Andrew Or authored 10 years ago

Revert "SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop"

This reverts commit 68cb69da.

26d31d15

[SPARK-3968][SQL] Use parquet-mr filter2 api · 2e35e242

Yash Datta authored 10 years ago

The parquet-mr project has introduced a new filter api (https://github.com/apache/incubator-parquet-mr/pull/4), along with several fixes . It can also eliminate entire RowGroups depending on certain statistics like min/max
We can leverage that to further improve performance of queries with filters.
Also filter2 api introduces ability to create custom filters. We can create a custom filter for the optimized In clause (InSet) , so that elimination happens in the ParquetRecordReader itself

Author: Yash Datta <Yash.Datta@guavus.com>

Closes #2841 from saucam/master and squashes the following commits:

8282ba0 [Yash Datta] SPARK-3968: fix scala code style and add some more tests for filtering on optional columns
515df1c [Yash Datta] SPARK-3968: Add a test case for filter pushdown on optional column
5f4530e [Yash Datta] SPARK-3968: Fix scala code style
f304667 [Yash Datta] SPARK-3968: Using task metadata strategy for row group filtering
ec53e92 [Yash Datta] SPARK-3968: No push down should result in case we are unable to create a record filter
48163c3 [Yash Datta] SPARK-3968: Code cleanup
cc7b596 [Yash Datta] SPARK-3968: 1. Fix RowGroupFiltering not working 2. Use the serialization/deserialization from Parquet library for filter pushdown
caed851 [Yash Datta] Revert "SPARK-3968: Not pushing the filters in case of OPTIONAL columns" since filtering on optional columns is now supported in filter2 api
49703c9 [Yash Datta] SPARK-3968: Not pushing the filters in case of OPTIONAL columns
9d09741 [Yash Datta] SPARK-3968: Change parquet filter pushdown to use filter2 api of parquet-mr

2e35e242

[SPARK-4120][SQL] Join of multiple tables with syntax like SELECT .. FROM... · 9b6ebe33

ravipesala authored 10 years ago

[SPARK-4120][SQL] Join of multiple tables with syntax like SELECT .. FROM T1,T2,T3.. does not work in SparkSQL

Right now it works for only 2 tables like below query.
sql("SELECT * FROM records1 as a,records2 as b where a.key=b.key ")

But it does not work for more than 2 tables like below query
sql("SELECT * FROM records1 as a,records2 as b,records3 as c where a.key=b.key and a.key=c.key").

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #2987 from ravipesala/multijoin and squashes the following commits:

429b005 [ravipesala] Support multiple joins

9b6ebe33

SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop · 68cb69da

Sean Owen authored 10 years ago

(This is just a look at what completely moving the classes would look like. I know Patrick flagged that as maybe not OK, although, it's private?)

Author: Sean Owen <sowen@cloudera.com>

Closes #2814 from srowen/SPARK-1209 and squashes the following commits:

ead1115 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though?
2d42c1d [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark

68cb69da

[SPARK-3661] Respect spark.*.memory in cluster mode · 2f545438

Andrew Or authored 10 years ago

This also includes minor re-organization of the code. Tested locally in both client and deploy modes.

Author: Andrew Or <andrew@databricks.com>
Author: Andrew Or <andrewor14@gmail.com>

Closes #2697 from andrewor14/memory-cluster-mode and squashes the following commits:

01d78bc [Andrew Or] Merge branch 'master' of github.com:apache/spark into memory-cluster-mode
ccd468b [Andrew Or] Add some comments per Patrick
c956577 [Andrew Or] Tweak wording
2b4afa0 [Andrew Or] Unused import
47a5a88 [Andrew Or] Correct Spark properties precedence order
bf64717 [Andrew Or] Merge branch 'master' of github.com:apache/spark into memory-cluster-mode
dd452d0 [Andrew Or] Respect spark.*.memory in cluster mode

2f545438

[SPARK-4153][WebUI] Update the sort keys for HistoryPage · d3450578

zsxwing authored 10 years ago

Sort "Started", "Completed", "Duration" and "Last Updated" by time.

Author: zsxwing <zsxwing@gmail.com>

Closes #3014 from zsxwing/SPARK-4153 and squashes the following commits:

ec8b9ad [zsxwing] Sort "Started", "Completed", "Duration" and "Last Updated" by time

d3450578

Minor style hot fix after #2711 · 849b43ec

Andrew Or authored 10 years ago

I had planned to fix this when I merged it but I forgot to. witgo

Author: Andrew Or <andrew@databricks.com>

Closes #3018 from andrewor14/command-utils-style and squashes the following commits:

c2959fb [Andrew Or] Style hot fix

849b43ec

[SPARK-4155] Consolidate usages of <driver> · 9334d699

Andrew Or authored 10 years ago

We use "\<driver\>" everywhere. Let's not do that.

Author: Andrew Or <andrew@databricks.com>

Closes #3020 from andrewor14/consolidate-driver and squashes the following commits:

c1c2204 [Andrew Or] Just use "<driver>" for local executor ID
3d751e9 [Andrew Or] Consolidate usages of <driver>

9334d699

[Minor] A few typos in comments and log messages · 5231a3f2

Andrew Or authored 10 years ago

Author: Andrew Or <andrewor14@gmail.com>
Author: Andrew Or <andrew@databricks.com>

Closes #3021 from andrewor14/typos and squashes the following commits:

daaf417 [Andrew Or] Merge branch 'master' of github.com:apache/spark into typos
4838ae4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into typos
026d426 [Andrew Or] Merge branch 'master' of github.com:andrewor14/spark into typos
a81ae8f [Andrew Or] Some typos

5231a3f2

[SPARK-4138][SPARK-4139] Improve dynamic allocation settings · 26f092d4

Andrew Or authored 10 years ago

This should be merged after #2746 (SPARK-3795).

**SPARK-4138**. If the user sets both the number of executors and `spark.dynamicAllocation.enabled`, we should throw an exception.

**SPARK-4139**. If the user sets `spark.dynamicAllocation.enabled`, we should use the max number of executors as the starting number of executors because the first job is likely to run immediately after application startup. If the latter is not set, throw an exception.

Author: Andrew Or <andrew@databricks.com>

Closes #3002 from andrewor14/yarn-set-executors and squashes the following commits:

c528fce [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-set-executors
55d4699 [Andrew Or] Bug fix: `isDynamicAllocationEnabled` was always false
2b0ccec [Andrew Or] Start the number of executors at the max
022bfde [Andrew Or] Guard against incompatible settings of number of executors

26f092d4

[SPARK-3319] [SPARK-3338] Resolve Spark submit config paths · 24c51292

Andrew Or authored 10 years ago

The bulk of this PR is comprised of tests. All changes in functionality are made in `SparkSubmit.scala` (~20 lines).

**SPARK-3319.** There is currently a divergence in behavior when the user passes in additional jars through `--jars` and through setting `spark.jars` in the default properties file. The former will happily resolve the paths (e.g. convert `my.jar` to `file:/absolute/path/to/my.jar`), while the latter does not. We should resolve paths consistently in both cases. This also applies to the following pairs of command line arguments and Spark configs:

- `--jars` ~ `spark.jars`
- `--files` ~ `spark.files` / `spark.yarn.dist.files`
- `--archives` ~ `spark.yarn.dist.archives`
- `--py-files` ~ `spark.submit.pyFiles`

**SPARK-3338.** This PR also fixes the following bug: if the user sets `spark.submit.pyFiles` in his/her properties file, it does not actually get picked up even if `--py-files` is not set. This is simply because the config is overridden by an empty string.

Author: Andrew Or <andrewor14@gmail.com>
Author: Andrew Or <andrew@databricks.com>

Closes #2232 from andrewor14/resolve-config-paths and squashes the following commits:

fff2869 [Andrew Or] Add spark.yarn.jar
da3a1c1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into resolve-config-paths
f0fae64 [Andrew Or] Merge branch 'master' of github.com:apache/spark into resolve-config-paths
05e03d6 [Andrew Or] Add tests for resolving both command line and config paths
460117e [Andrew Or] Resolve config paths properly
fe039d3 [Andrew Or] Beef up tests to test fixed-pointed-ness of Utils.resolveURI(s)

24c51292

[SPARK-4078] New FsPermission instance w/o FsPermission.createImmutable in eventlog · 9142c9b8

Grace authored 10 years ago

By default, Spark builds its package against Hadoop 1.0.4 version. In that version, it has some FsPermission bug (see [HADOOP-7629] (https://issues.apache.org/jira/browse/HADOOP-7629) by Todd Lipcon). This bug got fixed since 1.1 version. By using that FsPermission.createImmutable() API, end-user may see some RPC exception like below (if turn on eventlog over HDFS).  Here proposes a quick fix to avoid certain exception for all hadoop versions.
```
Exception in thread "main" java.io.IOException: Call to sr484/10.1.2.84:54310 failed on local exception: java.io.EOFException
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:1150)
        at org.apache.hadoop.ipc.Client.call(Client.java:1118)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
        at $Proxy6.setPermission(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
        at $Proxy6.setPermission(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.setPermission(DFSClient.java:1285)
        at org.apache.hadoop.hdfs.DistributedFileSystem.setPermission(DistributedFileSystem.java:572)
        at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:138)
        at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
        at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:324)
```

Author: Grace <jie.huang@intel.com>

Closes #2892 from GraceH/eventlog-rpc and squashes the following commits:

58ea038 [Grace] new FsPermission Instance w/o FsPermission.createImmutable

9142c9b8

[SPARK-4027][Streaming] WriteAheadLogBackedBlockRDD to read received either... · fb1fbca2

Tathagata Das authored 10 years ago

[SPARK-4027][Streaming] WriteAheadLogBackedBlockRDD to read received either from BlockManager or WAL in HDFS

As part of the initiative of preventing data loss on streaming driver failure, this sub-task implements a BlockRDD that is backed by HDFS. This BlockRDD can either read data from the Spark's BlockManager, or read the data from file-segments in write ahead log in HDFS.

Most of this code has been written by @harishreedharan

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #2931 from tdas/driver-ha-rdd and squashes the following commits:

209e49c [Tathagata Das] Better fix to style issue.
4a5866f [Tathagata Das] Addressed one more comment.
ed5fbf0 [Tathagata Das] Minor updates.
b0a18b1 [Tathagata Das] Fixed import order.
20aa7c6 [Tathagata Das] Fixed more line length issues.
29aa099 [Tathagata Das] Fixed line length issues.
9e47b5b [Tathagata Das] Renamed class, simplified+added unit tests.
6e1bfb8 [Tathagata Das] Tweaks testuite to create spark contxt lazily to prevent contxt leaks.
9c86a61 [Tathagata Das] Merge pull request #22 from harishreedharan/driver-ha-rdd
2878c38 [Hari Shreedharan] Shutdown spark context after tests. Formatting/minor fixes
c709f2f [Tathagata Das] Merge pull request #21 from harishreedharan/driver-ha-rdd
5cce16f [Hari Shreedharan] Make sure getBlockLocations uses offset and length to find the blocks on HDFS
eadde56 [Tathagata Das] Transferred HDFSBackedBlockRDD for the driver-ha-working branch

fb1fbca2

[SPARK-4028][Streaming] ReceivedBlockHandler interface to abstract the... · 234de923

Tathagata Das authored 10 years ago

[SPARK-4028][Streaming] ReceivedBlockHandler interface to abstract the functionality of storage of received data

As part of the initiative to prevent data loss on streaming driver failure, this JIRA tracks the subtask of implementing a ReceivedBlockHandler, that abstracts the functionality of storage of received data blocks. The default implementation will maintain the current behavior of storing the data into BlockManager. The optional implementation will store the data to both BlockManager as well as a write ahead log.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #2940 from tdas/driver-ha-rbh and squashes the following commits:

78a4aaa [Tathagata Das] Fixed bug causing test failures.
f192f47 [Tathagata Das] Fixed import order.
df5f320 [Tathagata Das] Updated code to use ReceivedBlockStoreResult as the return type for handler's storeBlock
33c30c9 [Tathagata Das] Added license, and organized imports.
2f025b3 [Tathagata Das] Updates based on PR comments.
18aec1e [Tathagata Das] Moved ReceivedBlockInfo back into spark.streaming.scheduler package
95a4987 [Tathagata Das] Added ReceivedBlockHandler and its associated tests

234de923

SPARK-4111 [MLlib] add regression metrics · d9327192

Yanbo Liang authored 10 years ago

Add RegressionMetrics.scala as regression metrics used for evaluation and corresponding test case RegressionMetricsSuite.scala.

Author: Yanbo Liang <yanbohappy@gmail.com>
Author: liangyanbo <liangyanbo@meituan.com>

Closes #2978 from yanbohappy/regression_metrics and squashes the following commits:

730d0a9 [Yanbo Liang] more clearly annotation
3d0bec1 [Yanbo Liang] rename and keep code style
a8ad3e3 [Yanbo Liang] simplify code for keeping style
d454909 [Yanbo Liang] rename parameter and function names, delete unused columns, add reference
2e56282 [liangyanbo] rename r2_score() and remove unused column
43bb12b [liangyanbo] add regression metrics

d9327192

[SPARK-4130][MLlib] Fixing libSVM parser bug with extra whitespace · c7ad0852

Joseph E. Gonzalez authored 10 years ago

This simple patch filters out extra whitespace entries.

Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
Author: Joey <joseph.e.gonzalez@gmail.com>

Closes #2996 from jegonzal/loadLibSVM and squashes the following commits:

e0227ab [Joey] improving readability
e028e84 [Joseph E. Gonzalez] fixing whitespace bug in loadLibSVMFile when parsing libSVM files

c7ad0852

[SPARK-4102] Remove unused ShuffleReader.stop() method. · 6db31574

Kay Ousterhout authored 10 years ago

This method is not implemented by the only subclass
(HashShuffleReader), nor is it ever called. While the
use of Scala's fancy "???" was pretty exciting, the method's
existence can only lead to confusion and it therefore should
be deleted.

mateiz was there a reason for adding this that I'm
missing?

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #2966 from kayousterhout/SPARK-4102 and squashes the following commits:

532c564 [Kay Ousterhout] Added back commented-out method, as per Matei's request
904655e [Kay Ousterhout] [SPARK-4102] Remove unused ShuffleReader.stop() method.

6db31574

[SPARK-1720][SPARK-1719] use LD_LIBRARY_PATH instead of -Djava.library.path · cd739bd7

GuoQiang Li authored 10 years ago

- [X] Standalone
- [X] YARN
- [X] Mesos
- [X]  Mac OS X
- [X] Linux
- [ ]  Windows

This is another implementation about #1031

Author: GuoQiang Li <witgo@qq.com>

Closes #2711 from witgo/SPARK-1719 and squashes the following commits:

c7b26f6 [GuoQiang Li] review commits
4488e41 [GuoQiang Li] Refactoring CommandUtils
a444094 [GuoQiang Li] review commits
40c0b4a [GuoQiang Li] Add buildLocalCommand method
c1a0ddd [GuoQiang Li] fix comments
156ce88 [GuoQiang Li] review commit
38aa377 [GuoQiang Li] Refactor CommandUtils.scala
4269e00 [GuoQiang Li] Refactor SparkSubmitDriverBootstrapper.scala
7a1d634 [GuoQiang Li] use LD_LIBRARY_PATH instead of -Djava.library.path

cd739bd7

Oct 29, 2014

[SPARK-4053][Streaming] Made the ReceiverSuite test more reliable, by fixing... · 12342580

Tathagata Das authored 10 years ago

[SPARK-4053][Streaming] Made the ReceiverSuite test more reliable, by fixing block generator throttling

In the unit test that checked whether blocks generated by throttled block generator had expected number of records, the thresholds are too tight, which sometimes led to the test failing.
This PR fixes it by relaxing the thresholds and the time intervals for testing.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #2900 from tdas/receiver-suite-flakiness and squashes the following commits:

28508a2 [Tathagata Das] Made the ReceiverSuite test more reliable

12342580