Commits · 7f32fd42aaadcf6626b4d8bcf6295203b43b2037 · cs525-sp18-g07 / spark

Apr 04, 2014

SPARK-1350. Always use JAVA_HOME to run executor container JVMs. · 7f32fd42

Sandy Ryza authored 10 years ago

Author: Sandy Ryza <sandy@cloudera.com>

Closes #313 from sryza/sandy-spark-1350 and squashes the following commits:

bb6d187 [Sandy Ryza] SPARK-1350. Always use JAVA_HOME to run executor container JVMs.

7f32fd42

SPARK-1337: Application web UI garbage collects newest stages · ee6e9e7d

Patrick Wendell authored 10 years ago

Simple fix...

Author: Patrick Wendell <pwendell@gmail.com>

Closes #320 from pwendell/stage-clean-up and squashes the following commits:

29be62e [Patrick Wendell] SPARK-1337: Application web UI garbage collects newest stages instead old ones

ee6e9e7d

Apr 03, 2014

Revert "[SPARK-1398] Removed findbugs jsr305 dependency" · 33e63618
Patrick Wendell authored 10 years ago
```
This reverts commit 92a86b28.
```
33e63618

Fix jenkins from giving the green light to builds that don't compile. · 9231b011

Michael Armbrust authored 10 years ago

Adding `| grep` swallows the non-zero return code from sbt failures. See [here](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13735/consoleFull) for a Jenkins run that fails to compile, but still gets a green light.

Note the [BUILD FIX] commit isn't actually part of this PR, but github is out of date.

Author: Michael Armbrust <michael@databricks.com>

Closes #317 from marmbrus/fixJenkins and squashes the following commits:

7c77ff9 [Michael Armbrust] Remove output filter that was swallowing non-zero exit codes for test failures.

9231b011

[BUILD FIX] Fix compilation of Spark SQL Java API. · d94826be

Michael Armbrust authored 10 years ago

The JavaAPI and the Parquet improvements PRs didn't conflict, but broke the build.

Author: Michael Armbrust <michael@databricks.com>

Closes #316 from marmbrus/hotFixJavaApi and squashes the following commits:

0b84c2d [Michael Armbrust] Fix compilation of Spark SQL Java API.

d94826be

[SPARK-1134] Fix and document passing of arguments to IPython · a599e43d

Diana Carroll authored 10 years ago

This is based on @dianacarroll's previous pull request https://github.com/apache/spark/pull/227, and @joshrosen's comments on https://github.com/apache/spark/pull/38. Since we do want to allow passing arguments to IPython, this does the following:
* It documents that IPython can't be used with standalone jobs for now. (Later versions of IPython will deal with PYTHONSTARTUP properly and enable this, see https://github.com/ipython/ipython/pull/5226, but no released version has that fix.)
* If you run `pyspark` with `IPYTHON=1`, it passes your command-line arguments to it. This way you can do stuff like `IPYTHON=1 bin/pyspark notebook`.
* The old `IPYTHON_OPTS` remains, but I've removed it from the documentation. This is in case people read an old tutorial that uses it.

This is not a perfect solution and I'd also be okay with keeping things as they are today (ignoring `$@` for IPython and using IPYTHON_OPTS), and only doing the doc change. With this change though, when IPython fixes https://github.com/ipython/ipython/pull/5226, people will immediately be able to do `IPYTHON=1 bin/pyspark myscript.py` to run a standalone script and get all the benefits of running scripts in IPython (presumably better debugging and such). Without it, there will be no way to run scripts in IPython.

@joshrosen you should probably take the final call on this.

Author: Diana Carroll <dcarroll@cloudera.com>

Closes #294 from mateiz/spark-1134 and squashes the following commits:

747bb13 [Diana Carroll] SPARK-1134 bug with ipython prevents non-interactive use with spark; only call ipython if no command line arguments were supplied

a599e43d

[SQL] SPARK-1333 First draft of java API · b8f53419

Michael Armbrust authored 10 years ago

WIP: Some work remains...
 * [x] Hive support
 * [x] Tests
 * [x] Update docs

Feedback welcome!

Author: Michael Armbrust <michael@databricks.com>

Closes #248 from marmbrus/javaSchemaRDD and squashes the following commits:

b393913 [Michael Armbrust] @srowen 's java style suggestions.
f531eb1 [Michael Armbrust] Address matei's comments.
33a1b1a [Michael Armbrust] Ignore JavaHiveSuite.
822f626 [Michael Armbrust] improve docs.
ab91750 [Michael Armbrust] Improve Java SQL API: * Change JavaRow => Row * Add support for querying RDDs of JavaBeans * Docs * Tests * Hive support
0b859c8 [Michael Armbrust] First draft of java API.

b8f53419

Spark 1162 Implemented takeOrdered in pyspark. · c1ea3afb

Prashant Sharma authored 10 years ago

Since python does not have a library for max heap and usual tricks like inverting values etc.. does not work for all cases.

We have our own implementation of max heap.

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #97 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered2 and squashes the following commits:

35f86ba [Prashant Sharma] code review
2b1124d [Prashant Sharma] fixed tests
e8a08e2 [Prashant Sharma] Code review comments.
49e6ba7 [Prashant Sharma] SPARK-1162 added takeOrdered to pyspark

c1ea3afb

[SPARK-1360] Add Timestamp Support for SQL · 5d1feda2

Cheng Hao authored 10 years ago

This PR includes:
1) Add new data type Timestamp
2) Add more data type casting base on Hive's Rule
3) Fix bug missing data type in both parsers (HiveQl & SQLParser).

Author: Cheng Hao <hao.cheng@intel.com>

Closes #275 from chenghao-intel/timestamp and squashes the following commits:

df709e5 [Cheng Hao] Move orc_ends_with_nulls to blacklist
24b04b0 [Cheng Hao] Put 3 cases into the black lists(describe_pretty,describe_syntax,lateral_view_outer)
fc512c2 [Cheng Hao] remove the unnecessary data type equality check in data casting
d0d1919 [Cheng Hao] Add more data type for scala reflection
3259808 [Cheng Hao] Add the new Golden files
3823b97 [Cheng Hao] Update the UnitTest cases & add timestamp type for HiveQL
54a0489 [Cheng Hao] fix bug mapping to 0 (which is supposed to be null) when NumberFormatException occurs
9cb505c [Cheng Hao] Fix issues according to PR comments
e529168 [Cheng Hao] Fix bug of converting from String
6fc8100 [Cheng Hao] Update Unit Test & CodeStyle
8a1d4d6 [Cheng Hao] Add DataType for SqlParser
ce4385e [Cheng Hao] Add TimestampType Support

5d1feda2

Spark parquet improvements · fbebaedf

Andre Schumacher authored 10 years ago

A few improvements to the Parquet support for SQL queries:
- Instead of files a ParquetRelation is now backed by a directory, which simplifies importing data from other
  sources
- InsertIntoParquetTable operation now supports switching between overwriting or appending (at least in
  HiveQL)
- tests now use the new API
- Parquet logging can be set to WARNING level (Default)
- Default compression for Parquet files (GZIP, as in parquet-mr)

Author: Andre Schumacher <andre.schumacher@iki.fi>

Closes #195 from AndreSchumacher/spark_parquet_improvements and squashes the following commits:

54df314 [Andre Schumacher] SPARK-1383 [SQL] Improvements to ParquetRelation

fbebaedf

[SPARK-1398] Removed findbugs jsr305 dependency · 92a86b28

Mark Hamstra authored 10 years ago

Should be a painless upgrade, and does offer some significant advantages should we want to leverage FindBugs more during the 1.0 lifecycle. http://findbugs.sourceforge.net/findbugs2.html

Author: Mark Hamstra <markhamstra@gmail.com>

Closes #307 from markhamstra/findbugs and squashes the following commits:

99f2d09 [Mark Hamstra] Removed unnecessary findbugs jsr305 dependency

92a86b28

Apr 02, 2014

[SQL] SPARK-1364 Improve datatype and test coverage for ScalaReflection schema inference. · 47ebea54

Michael Armbrust authored 10 years ago

Author: Michael Armbrust <michael@databricks.com>

Closes #293 from marmbrus/reflectTypes and squashes the following commits:

f54e8e8 [Michael Armbrust] Improve datatype and test coverage for ScalaReflection schema inference.

47ebea54

[SPARK-1212, Part II] Support sparse data in MLlib · 9c65fa76

Xiangrui Meng authored 10 years ago

In PR https://github.com/apache/spark/pull/117, we added dense/sparse vector data model and updated KMeans to support sparse input. This PR is to replace all other `Array[Double]` usage by `Vector` in generalized linear models (GLMs) and Naive Bayes. Major changes:

1. `LabeledPoint` becomes `LabeledPoint(Double, Vector)`.
2. Methods that accept `RDD[Array[Double]]` now accept `RDD[Vector]`. We cannot support both in an elegant way because of type erasure.
3. Mark 'createModel' and 'predictPoint' protected because they are not for end users.
4. Add libSVMFile to MLContext.
5. NaiveBayes can accept arbitrary labels (introducing a breaking change to Python's `NaiveBayesModel`).
6. Gradient computation no longer creates temp vectors.
7. Column normalization and centering are removed from Lasso and Ridge because the operation will densify the data. Simple feature transformation can be done before training.

TODO:
1. ~~Use axpy when possible.~~
2. ~~Optimize Naive Bayes.~~

Author: Xiangrui Meng <meng@databricks.com>

Closes #245 from mengxr/vector and squashes the following commits:

eb6e793 [Xiangrui Meng] move libSVMFile to MLUtils and rename to loadLibSVMData
c26c4fc [Xiangrui Meng] update DecisionTree to use RDD[Vector]
11999c7 [Xiangrui Meng] Merge branch 'master' into vector
f7da54b [Xiangrui Meng] add minSplits to libSVMFile
da25e24 [Xiangrui Meng] revert the change to default addIntercept because it might change the behavior of existing code without warning
493f26f [Xiangrui Meng] Merge branch 'master' into vector
7c1bc01 [Xiangrui Meng] add a TODO to NB
b9b7ef7 [Xiangrui Meng] change default value of addIntercept to false
b01df54 [Xiangrui Meng] allow to change or clear threshold in LR and SVM
4addc50 [Xiangrui Meng] merge master
4ca5b1b [Xiangrui Meng] remove normalization from Lasso and update tests
f04fe8a [Xiangrui Meng] remove normalization from RidgeRegression and update tests
d088552 [Xiangrui Meng] use static constructor for MLContext
6f59eed [Xiangrui Meng] update libSVMFile to determine number of features automatically
3432e84 [Xiangrui Meng] update NaiveBayes to support sparse data
0f8759b [Xiangrui Meng] minor updates to NB
b11659c [Xiangrui Meng] style update
78c4671 [Xiangrui Meng] add libSVMFile to MLContext
f0fe616 [Xiangrui Meng] add a test for sparse linear regression
44733e1 [Xiangrui Meng] use in-place gradient computation
e981396 [Xiangrui Meng] use axpy in Updater
db808a1 [Xiangrui Meng] update JavaLR example
befa592 [Xiangrui Meng] passed scala/java tests
75c83a4 [Xiangrui Meng] passed test compile
1859701 [Xiangrui Meng] passed compile
834ada2 [Xiangrui Meng] optimized MLUtils.computeStats update some ml algorithms to use Vector (cont.)
135ab72 [Xiangrui Meng] merge glm
0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected
d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used
3f346ba [Xiangrui Meng] update some ml algorithms to use Vector

9c65fa76

StopAfter / TopK related changes · ed730c95

Reynold Xin authored 10 years ago

1. Renamed StopAfter to Limit to be more consistent with naming in other relational databases.
2. Renamed TopK to TakeOrdered to be more consistent with Spark RDD API.
3. Avoid breaking lineage in Limit.
4. Added a bunch of override's to execution/basicOperators.scala.

@marmbrus @liancheng

Author: Reynold Xin <rxin@apache.org>
Author: Michael Armbrust <michael@databricks.com>

Closes #233 from rxin/limit and squashes the following commits:

13eb12a [Reynold Xin] Merge pull request #1 from marmbrus/limit
92b9727 [Michael Armbrust] More hacks to make Maps serialize with Kryo.
4fc8b4e [Reynold Xin] Merge branch 'master' of github.com:apache/spark into limit
87b7d37 [Reynold Xin] Use the proper serializer in limit.
9b79246 [Reynold Xin] Updated doc for Limit.
47d3327 [Reynold Xin] Copy tuples in Limit before shuffle.
231af3a [Reynold Xin] Limit/TakeOrdered: 1. Renamed StopAfter to Limit to be more consistent with naming in other relational databases. 2. Renamed TopK to TakeOrdered to be more consistent with Spark RDD API. 3. Avoid breaking lineage in Limit. 4. Added a bunch of override's to execution/basicOperators.scala.

ed730c95

[SPARK-1371][WIP] Compression support for Spark SQL in-memory columnar storage · 1faa5797

Cheng Lian authored 10 years ago

JIRA issue: [SPARK-1373](https://issues.apache.org/jira/browse/SPARK-1373)

(Although tagged as WIP, this PR is structurally complete. The only things left unimplemented are 3 more compression algorithms: `BooleanBitSet`, `IntDelta` and `LongDelta`, which are trivial to add later in this or another separate PR.)

This PR contains compression support for Spark SQL in-memory columnar storage. Main interfaces include:

*   `CompressionScheme`

    Each `CompressionScheme` represents a concrete compression algorithm, which basically consists of an `Encoder` for compression and a `Decoder` for decompression. Algorithms implemented include:

    * `RunLengthEncoding`
    * `DictionaryEncoding`

    Algorithms to be implemented include:

    * `BooleanBitSet`
    * `IntDelta`
    * `LongDelta`

*   `CompressibleColumnBuilder`

    A stackable `ColumnBuilder` trait used to build byte buffers for compressible columns.  A best `CompressionScheme` that exhibits lowest compression ratio is chosen for each column according to statistical information gathered while elements are appended into the `ColumnBuilder`. However, if no `CompressionScheme` can achieve a compression ratio better than 80%, no compression will be done for this column to save CPU time.

    Memory layout of the final byte buffer is showed below:

    ```
     .--------------------------- Column type ID (4 bytes)
     |   .----------------------- Null count N (4 bytes)
     |   |   .------------------- Null positions (4 x N bytes, empty if null count is zero)
     |   |   |     .------------- Compression scheme ID (4 bytes)
     |   |   |     |   .--------- Compressed non-null elements
     V   V   V     V   V
    +---+---+-----+---+---------+
    |   |   | ... |   | ... ... |
    +---+---+-----+---+---------+
     \-----------/ \-----------/
        header         body
    ```

*   `CompressibleColumnAccessor`

    A stackable `ColumnAccessor` trait used to iterate (possibly) compressed data column.

*   `ColumnStats`

    Used to collect statistical information while loading data into in-memory columnar table. Optimizations like partition pruning rely on this information.

    Strictly speaking, `ColumnStats` related code is not part of the compression support. It's contained in this PR to ensure and validate the row-based API design (which is used to avoid boxing/unboxing cost whenever possible).

A major refactoring change since PR #205 is:

* Refactored all getter/setter methods for primitive types in various places into `ColumnType` classes to remove duplicated code.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #285 from liancheng/memColumnarCompression and squashes the following commits:

ed71bbd [Cheng Lian] Addressed all PR comments by @marmbrus
d3a4fa9 [Cheng Lian] Removed Ordering[T] in ColumnStats for better performance
5034453 [Cheng Lian] Bug fix, more tests, and more refactoring
c298b76 [Cheng Lian] Test suites refactored
2780d6a [Cheng Lian] [WIP] in-memory columnar compression support
211331c [Cheng Lian] WIP: in-memory columnar compression support
85cc59b [Cheng Lian] Refactored ColumnAccessors & ColumnBuilders to remove duplicate code

1faa5797

Do not re-use objects in the EdgePartition/EdgeTriplet iterators. · 78236334

Daniel Darabos authored 10 years ago

This avoids a silent data corruption issue (https://spark-project.atlassian.net/browse/SPARK-1188) and has no performance impact by my measurements. It also simplifies the code. As far as I can tell the object re-use was nothing but premature optimization.

I did actual benchmarks for all the included changes, and there is no performance difference. I am not sure where to put the benchmarks. Does Spark not have a benchmark suite?

This is an example benchmark I did:

test("benchmark") {
  val builder = new EdgePartitionBuilder[Int]
  for (i <- (1 to 10000000)) {
    builder.add(i.toLong, i.toLong, i)
  }
  val p = builder.toEdgePartition
  p.map(_.attr + 1).iterator.toList
}

It ran for 10 seconds both before and after this change.

Author: Daniel Darabos <darabos.daniel@gmail.com>

Closes #276 from darabos/spark-1188 and squashes the following commits:

574302b [Daniel Darabos] Restore "manual" copying in EdgePartition.map(Iterator). Add comment to discourage novices like myself from trying to simplify the code.
4117a64 [Daniel Darabos] Revert EdgePartitionSuite.
4955697 [Daniel Darabos] Create a copy of the Edge objects in EdgeRDD.compute(). This avoids exposing the object re-use, while still enables the more efficient behavior for internal code.
4ec77f8 [Daniel Darabos] Add comments about object re-use to the affected functions.
2da5e87 [Daniel Darabos] Restore object re-use in EdgePartition.
0182f2b [Daniel Darabos] Do not re-use objects in the EdgePartition/EdgeTriplet iterators. This avoids a silent data corruption issue (SPARK-1188) and has no performance impact in my measurements. It also simplifies the code.
c55f52f [Daniel Darabos] Tests that reproduce the problems from SPARK-1188.

78236334

[SPARK-1385] Use existing code for JSON de/serialization of BlockId · de8eefa8

Andrew Or authored 10 years ago

`BlockId.scala` offers a way to reconstruct a BlockId from a string through regex matching. `util/JsonProtocol.scala` duplicates this functionality by explicitly matching on the BlockId type.
With this PR, the de/serialization of BlockIds will go through the first (older) code path.

(Most of the line changes in this PR involve changing `==` to `===` in `JsonProtocolSuite.scala`)

Author: Andrew Or <andrewor14@gmail.com>

Closes #289 from andrewor14/blockid-json and squashes the following commits:

409d226 [Andrew Or] Simplify JSON de/serialization for BlockId

de8eefa8

Renamed stageIdToActiveJob to jobIdToActiveJob. · 11973a7b

Kay Ousterhout authored 10 years ago

This data structure was misused and, as a result, later renamed to an incorrect name.

This data structure seems to have gotten into this tangled state as a result of @henrydavidge using the stageID instead of the job Id to index into it and later @andrewor14 renaming the data structure to reflect this misunderstanding.

This patch renames it and removes an incorrect indexing into it. The incorrect indexing into it meant that the code added by @henrydavidge to warn when a task size is too large (added here https://github.com/apache/spark/commit/57579934f0454f258615c10e69ac2adafc5b9835) was not always executed; this commit fixes that.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #301 from kayousterhout/fixCancellation and squashes the following commits:

bd3d3a4 [Kay Ousterhout] Renamed stageIdToActiveJob to jobIdToActiveJob.

11973a7b

Remove * from test case golden filename. · ea9de658

Michael Armbrust authored 10 years ago

@rxin mentioned this might cause issues on windows machines.

Author: Michael Armbrust <michael@databricks.com>

Closes #297 from marmbrus/noStars and squashes the following commits:

263122a [Michael Armbrust] Remove * from test case golden filename.

ea9de658

Apr 01, 2014

MLI-1 Decision Trees · 8b3045ce

Manish Amde authored 10 years ago

Joint work with @hirakendu, @etrain, @atalwalkar and @harsha2010.

Key features:
+ Supports binary classification and regression
+ Supports gini, entropy and variance for information gain calculation
+ Supports both continuous and categorical features

The algorithm has gone through several development iterations over the last few months leading to a highly optimized implementation. Optimizations include:

1. Level-wise training to reduce passes over the entire dataset.
2. Bin-wise split calculation to reduce computation overhead.
3. Aggregation over partitions before combining to reduce communication overhead.

Author: Manish Amde <manish9ue@gmail.com>
Author: manishamde <manish9ue@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #79 from manishamde/tree and squashes the following commits:

1e8c704 [Manish Amde] remove numBins field in the Strategy class
7d54b4f [manishamde] Merge pull request #4 from mengxr/dtree
f536ae9 [Xiangrui Meng] another pass on code style
e1dd86f [Manish Amde] implementing code style suggestions
62dc723 [Manish Amde] updating javadoc and converting helper methods to package private to allow unit testing
201702f [Manish Amde] making some more methods private
f963ef5 [Manish Amde] making methods private
c487e6a [manishamde] Merge pull request #1 from mengxr/dtree
24500c5 [Xiangrui Meng] minor style updates
4576b64 [Manish Amde] documentation and for to while loop conversion
ff363a7 [Manish Amde] binary search for bins and while loop for categorical feature bins
632818f [Manish Amde] removing threshold for classification predict method
2116360 [Manish Amde] removing dummy bin calculation for categorical variables
6068356 [Manish Amde] ensuring num bins is always greater than max number of categories
62c2562 [Manish Amde] fixing comment indentation
ad1fc21 [Manish Amde] incorporated mengxr's code style suggestions
d1ef4f6 [Manish Amde] more documentation
794ff4d [Manish Amde] minor improvements to docs and style
eb8fcbe [Manish Amde] minor code style updates
cd2c2b4 [Manish Amde] fixing code style based on feedback
63e786b [Manish Amde] added multiple train methods for java compatability
d3023b3 [Manish Amde] adding more docs for nested methods
84f85d6 [Manish Amde] code documentation
9372779 [Manish Amde] code style: max line lenght <= 100
dd0c0d7 [Manish Amde] minor: some docs
0dd7659 [manishamde] basic doc
5841c28 [Manish Amde] unit tests for categorical features
f067d68 [Manish Amde] minor cleanup
c0e522b [Manish Amde] updated predict and split threshold logic
b09dc98 [Manish Amde] minor refactoring
6b7de78 [Manish Amde] minor refactoring and tests
d504eb1 [Manish Amde] more tests for categorical features
dbb7ac1 [Manish Amde] categorical feature support
6df35b9 [Manish Amde] regression predict logic
53108ed [Manish Amde] fixing index for highest bin
e23c2e5 [Manish Amde] added regression support
c8f6d60 [Manish Amde] adding enum for feature type
b0e3e76 [Manish Amde] adding enum for feature type
154aa77 [Manish Amde] enums for configurations
733d6dd [Manish Amde] fixed tests
02c595c [Manish Amde] added command line parsing
98ec8d5 [Manish Amde] tree building and prediction logic
b0eb866 [Manish Amde] added logic to handle leaf nodes
80e8c66 [Manish Amde] working version of multi-level split calculation
4798aae [Manish Amde] added gain stats class
dad0afc [Manish Amde] decison stump functionality working
03f534c [Manish Amde] some more tests
0012a77 [Manish Amde] basic stump working
8bca1e2 [Manish Amde] additional code for creating intermediate RDD
92cedce [Manish Amde] basic building blocks for intermediate RDD calculation. untested.
cd53eae [Manish Amde] skeletal framework

8b3045ce

Revert "[Spark-1134] only call ipython if no arguments are given; remove IPYTHONOPTS from call" · 45df9127
Matei Zaharia authored 10 years ago
```
This reverts commit afb5ea62.
```
45df9127

[Spark-1134] only call ipython if no arguments are given; remove IPYTHONOPTS from call · afb5ea62

Diana Carroll authored 10 years ago

see comments on Pull Request https://github.com/apache/spark/pull/38
(i couldn't figure out how to modify an existing pull request, so I'm hoping I can withdraw that one and replace it with this one.)

Author: Diana Carroll <dcarroll@cloudera.com>

Closes #227 from dianacarroll/spark-1134 and squashes the following commits:

ffe47f2 [Diana Carroll] [spark-1134] remove ipythonopts from ipython command
b673bf7 [Diana Carroll] Merge branch 'master' of github.com:apache/spark
0309cf9 [Diana Carroll] SPARK-1134 bug with ipython prevents non-interactive use with spark; only call ipython if no command line arguments were supplied

afb5ea62

[SPARK-1342] Scala 2.10.4 · 764353d2

Mark Hamstra authored 10 years ago

Just a Scala version increment

Author: Mark Hamstra <markhamstra@gmail.com>

Closes #259 from markhamstra/scala-2.10.4 and squashes the following commits:

fbec547 [Mark Hamstra] [SPARK-1342] Bumped Scala version to 2.10.4

764353d2

[SQL] SPARK-1372 Support for caching and uncaching tables in a SQLContext. · f5c418da

Michael Armbrust authored 10 years ago

This doesn't yet support different databases in Hive (though you can probably workaround this by calling `USE <dbname>`). However, given the time constraints for 1.0 I think its probably worth including this now and extending the functionality in the next release.

Author: Michael Armbrust <michael@databricks.com>

Closes #282 from marmbrus/cacheTables and squashes the following commits:

83785db [Michael Armbrust] Support for caching and uncaching tables in a SQLContext.

f5c418da

[Hot Fix #42] Persisted RDD disappears on storage page if re-used · ada310a9

Andrew Or authored 10 years ago

If a previously persisted RDD is re-used, its information disappears from the Storage page.

This is because the tasks associated with re-using the RDD do not report the RDD's blocks as updated (which is correct). On stage submit, however, we overwrite any existing information regarding that RDD with a fresh one, whether or not the information for the RDD already exists.

Author: Andrew Or <andrewor14@gmail.com>

Closes #281 from andrewor14/ui-storage-fix and squashes the following commits:

408585a [Andrew Or] Fix storage UI bug

ada310a9

Mar 31, 2014

[SPARK-1377] Upgrade Jetty to 8.1.14v20131031 · 94fe7fd4

Andrew Or authored 10 years ago

Previous version was 7.6.8v20121106. The only difference between Jetty 7 and Jetty 8 is that the former uses Servlet API 2.5, while the latter uses Servlet API 3.0.

Author: Andrew Or <andrewor14@gmail.com>

Closes #280 from andrewor14/jetty-upgrade and squashes the following commits:

dd57104 [Andrew Or] Merge github.com:apache/spark into jetty-upgrade
e75fa85 [Andrew Or] Upgrade Jetty to 8.1.14v20131031

94fe7fd4

SPARK-1376. In the yarn-cluster submitter, rename "args" option to "arg" · 564f1c13

Sandy Ryza authored 10 years ago

Author: Sandy Ryza <sandy@cloudera.com>

Closes #279 from sryza/sandy-spark-1376 and squashes the following commits:

d8aebfa [Sandy Ryza] SPARK-1376. In the yarn-cluster submitter, rename "args" option to "arg"

564f1c13

SPARK-1365 [HOTFIX] Fix RateLimitedOutputStream test · 33b3c2a8

Patrick Wendell authored 10 years ago

This test needs to be fixed. It currently depends on Thread.sleep() having exact-timing
semantics, which is not a valid assumption.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #277 from pwendell/rate-limited-stream and squashes the following commits:

6c0ff81 [Patrick Wendell] SPARK-1365: Fix RateLimitedOutputStream test

33b3c2a8

[SQL] Rewrite join implementation to allow streaming of one relation. · 5731af5b

Michael Armbrust authored 10 years ago

Before we were materializing everything in memory.  This also uses the projection interface so will be easier to plug in code gen (its ported from that branch).

@rxin @liancheng

Author: Michael Armbrust <michael@databricks.com>

Closes #250 from marmbrus/hashJoin and squashes the following commits:

1ad873e [Michael Armbrust] Change hasNext logic back to the correct version.
8e6f2a2 [Michael Armbrust] Review comments.
1e9fb63 [Michael Armbrust] style
bc0cb84 [Michael Armbrust] Rewrite join implementation to allow streaming of one relation.

5731af5b

SPARK-1352: Improve robustness of spark-submit script · 841721e0

Patrick Wendell authored 10 years ago

1. Better error messages when required arguments are missing.
2. Support for unit testing cases where presented arguments are invalid.
3. Bug fix: Only use environment varaibles when they are set (otherwise will cause NPE).
4. A verbose mode to aid debugging.
5. Visibility of several variables is set to private.
6. Deprecation warning for existing scripts.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #271 from pwendell/spark-submit and squashes the following commits:

9146def [Patrick Wendell] SPARK-1352: Improve robustness of spark-submit script

841721e0

Mar 30, 2014

SPARK-1352 - Comment style single space before ending */ check. · d6660536

Prashant Sharma authored 10 years ago

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #261 from ScrapCodes/comment-style-check2 and squashes the following commits:

6cde61e [Prashant Sharma] comment style space before ending */ check.

d6660536

[SPARK-1354][SQL] Add tableName as a qualifier for SimpleCatelogy · 95d7d2a3

jerryshao authored 10 years ago

Fix attribute unresolved when query with table name as a qualifier in SQLContext with SimplCatelog, details please see [SPARK-1354](https://issues.apache.org/jira/browse/SPARK-1354?jql=project%20%3D%20SPARK).

Author: jerryshao <saisai.shao@intel.com>

Closes #272 from jerryshao/qualifier-fix and squashes the following commits:

7950170 [jerryshao] Add tableName as a qualifier for SimpleCatelogy

95d7d2a3

SPARK-1336 Reducing the output of run-tests script. · df1b9f7b

Prashant Sharma authored 10 years ago

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Prashant Sharma <scrapcodes@gmail.com>

Closes #262 from ScrapCodes/SPARK-1336/ReduceVerbosity and squashes the following commits:

87dfa54 [Prashant Sharma] Further reduction in noise and made pyspark tests to fail fast.
811170f [Prashant Sharma] Reducing the ouput of run-tests script.

df1b9f7b

[SQL] SPARK-1354 Fix self-joins of parquet relations · 2861b07b

Michael Armbrust authored 10 years ago

@AndreSchumacher, please take a look.

https://spark-project.atlassian.net/browse/SPARK-1354

Author: Michael Armbrust <michael@databricks.com>

Closes #269 from marmbrus/parquetJoin and squashes the following commits:

4081e77 [Michael Armbrust] Create new instances of Parquet relation when multiple copies are in a single plan.

2861b07b

Don't swallow all kryo errors, only those that indicate we are out of data. · 92b83959

Michael Armbrust authored 10 years ago

Author: Michael Armbrust <michael@databricks.com>

Closes #142 from marmbrus/kryoErrors and squashes the following commits:

9c72d1f [Michael Armbrust] Make the test more future proof.
78f5a42 [Michael Armbrust] Don't swallow all kryo errors, only those that indicate we are out of data.

92b83959

Mar 29, 2014

[SPARK-1186] : Enrich the Spark Shell to support additional arguments. · fda86d8b

Bernardo Gomez Palacio authored 10 years ago

Enrich the Spark Shell functionality to support the following options.

```
Usage: spark-shell [OPTIONS]

OPTIONS:
    -h  --help              : Print this help information.
    -c  --cores             : The maximum number of cores to be used by the Spark Shell.
    -em --executor-memory   : The memory used by each executor of the Spark Shell, the number
                              is followed by m for megabytes or g for gigabytes, e.g. "1g".
    -dm --driver-memory     : The memory used by the Spark Shell, the number is followed
                              by m for megabytes or g for gigabytes, e.g. "1g".
    -m  --master            : A full string that describes the Spark Master, defaults to "local"
                              e.g. "spark://localhost:7077".
    --log-conf              : Enables logging of the supplied SparkConf as INFO at start of the
                              Spark Context.

e.g.
    spark-shell -m spark://localhost:7077 -c 4 -dm 512m -em 2g
```

**Note**: this commit reflects the changes applied to _master_ based on [5d98cfc1].

[ticket: SPARK-1186] : Enrich the Spark Shell to support additional arguments.
                        https://spark-project.atlassian.net/browse/SPARK-1186

Author      : bernardo.gomezpalcio@gmail.com

Author: Bernardo Gomez Palacio <bernardo.gomezpalacio@gmail.com>

Closes #116 from berngp/feature/enrich-spark-shell and squashes the following commits:

c5f455f [Bernardo Gomez Palacio] [SPARK-1186] : Enrich the Spark Shell to support additional arguments.

fda86d8b

Implement the RLike & Like in catalyst · af3746ce

Cheng Hao authored 10 years ago

This PR includes:
1) Unify the unit test for expression evaluation
2) Add implementation of RLike & Like

Author: Cheng Hao <hao.cheng@intel.com>

Closes #224 from chenghao-intel/string_expression and squashes the following commits:

84f72e9 [Cheng Hao] fix bug in RLike/Like & Simplify the unit test
aeeb1d7 [Cheng Hao] Simplify the implementation/unit test of RLike/Like
319edb7 [Cheng Hao] change to spark code style
91cfd33 [Cheng Hao] add implementation for rlike/like
2c8929e [Cheng Hao] Update the unit test for expression evaluation

af3746ce

SPARK-1126. spark-app preliminary · 16178160

Sandy Ryza authored 10 years ago

This is a starting version of the spark-app script for running compiled binaries against Spark. It still needs tests and some polish. The only testing I've done so far has been using it to launch jobs in yarn-standalone mode against a pseudo-distributed cluster.

This leaves out the changes required for launching python scripts. I think it might be best to save those for another JIRA/PR (while keeping to the design so that they won't require backwards-incompatible changes).

Author: Sandy Ryza <sandy@cloudera.com>

Closes #86 from sryza/sandy-spark-1126 and squashes the following commits:

d428d85 [Sandy Ryza] Commenting, doc, and import fixes from Patrick's comments
e7315c6 [Sandy Ryza] Fix failing tests
34de899 [Sandy Ryza] Change --more-jars to --jars and fix docs
299ddca [Sandy Ryza] Fix scalastyle
a94c627 [Sandy Ryza] Add newline at end of SparkSubmit
04bc4e2 [Sandy Ryza] SPARK-1126. spark-submit script

16178160

SPARK-1345 adding missing dependency on avro for hadoop 0.23 to the new ... · 3738f244

Thomas Graves authored 11 years ago

...sql pom files

Author: Thomas Graves <tgraves@apache.org>

Closes #263 from tgravescs/SPARK-1345 and squashes the following commits:

b43a2a0 [Thomas Graves] SPARK-1345 adding missing dependency on avro for hadoop 0.23 to the new sql pom files

3738f244

Mar 28, 2014

fix path for jar, make sed actually work on OSX · 75d46be5

Nick Lanham authored 11 years ago

Author: Nick Lanham <nick@afternight.org>

Closes #264 from nicklan/make-distribution-fixes and squashes the following commits:

172b981 [Nick Lanham] fix path for jar, make sed actually work on OSX

75d46be5