Commits · fda86d8b46a1cc484d11ac5446d8cc2a86429b9b · cs525-sp18-g07 / spark

Mar 29, 2014

[SPARK-1186] : Enrich the Spark Shell to support additional arguments. · fda86d8b

Bernardo Gomez Palacio authored 10 years ago

Enrich the Spark Shell functionality to support the following options.

```
Usage: spark-shell [OPTIONS]

OPTIONS:
    -h  --help              : Print this help information.
    -c  --cores             : The maximum number of cores to be used by the Spark Shell.
    -em --executor-memory   : The memory used by each executor of the Spark Shell, the number
                              is followed by m for megabytes or g for gigabytes, e.g. "1g".
    -dm --driver-memory     : The memory used by the Spark Shell, the number is followed
                              by m for megabytes or g for gigabytes, e.g. "1g".
    -m  --master            : A full string that describes the Spark Master, defaults to "local"
                              e.g. "spark://localhost:7077".
    --log-conf              : Enables logging of the supplied SparkConf as INFO at start of the
                              Spark Context.

e.g.
    spark-shell -m spark://localhost:7077 -c 4 -dm 512m -em 2g
```

**Note**: this commit reflects the changes applied to _master_ based on [5d98cfc1].

[ticket: SPARK-1186] : Enrich the Spark Shell to support additional arguments.
                        https://spark-project.atlassian.net/browse/SPARK-1186

Author      : bernardo.gomezpalcio@gmail.com

Author: Bernardo Gomez Palacio <bernardo.gomezpalacio@gmail.com>

Closes #116 from berngp/feature/enrich-spark-shell and squashes the following commits:

c5f455f [Bernardo Gomez Palacio] [SPARK-1186] : Enrich the Spark Shell to support additional arguments.

fda86d8b

Implement the RLike & Like in catalyst · af3746ce

Cheng Hao authored 10 years ago

This PR includes:
1) Unify the unit test for expression evaluation
2) Add implementation of RLike & Like

Author: Cheng Hao <hao.cheng@intel.com>

Closes #224 from chenghao-intel/string_expression and squashes the following commits:

84f72e9 [Cheng Hao] fix bug in RLike/Like & Simplify the unit test
aeeb1d7 [Cheng Hao] Simplify the implementation/unit test of RLike/Like
319edb7 [Cheng Hao] change to spark code style
91cfd33 [Cheng Hao] add implementation for rlike/like
2c8929e [Cheng Hao] Update the unit test for expression evaluation

af3746ce

SPARK-1126. spark-app preliminary · 16178160

Sandy Ryza authored 10 years ago

This is a starting version of the spark-app script for running compiled binaries against Spark. It still needs tests and some polish. The only testing I've done so far has been using it to launch jobs in yarn-standalone mode against a pseudo-distributed cluster.

This leaves out the changes required for launching python scripts. I think it might be best to save those for another JIRA/PR (while keeping to the design so that they won't require backwards-incompatible changes).

Author: Sandy Ryza <sandy@cloudera.com>

Closes #86 from sryza/sandy-spark-1126 and squashes the following commits:

d428d85 [Sandy Ryza] Commenting, doc, and import fixes from Patrick's comments
e7315c6 [Sandy Ryza] Fix failing tests
34de899 [Sandy Ryza] Change --more-jars to --jars and fix docs
299ddca [Sandy Ryza] Fix scalastyle
a94c627 [Sandy Ryza] Add newline at end of SparkSubmit
04bc4e2 [Sandy Ryza] SPARK-1126. spark-submit script

16178160

SPARK-1345 adding missing dependency on avro for hadoop 0.23 to the new ... · 3738f244

Thomas Graves authored 10 years ago

...sql pom files

Author: Thomas Graves <tgraves@apache.org>

Closes #263 from tgravescs/SPARK-1345 and squashes the following commits:

b43a2a0 [Thomas Graves] SPARK-1345 adding missing dependency on avro for hadoop 0.23 to the new sql pom files

3738f244

Mar 28, 2014

fix path for jar, make sed actually work on OSX · 75d46be5

Nick Lanham authored 10 years ago

Author: Nick Lanham <nick@afternight.org>

Closes #264 from nicklan/make-distribution-fixes and squashes the following commits:

172b981 [Nick Lanham] fix path for jar, make sed actually work on OSX

75d46be5

SPARK-1096, a space after comment start style checker. · 60abc252

Prashant Sharma authored 10 years ago

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #124 from ScrapCodes/SPARK-1096/scalastyle-comment-check and squashes the following commits:

214135a [Prashant Sharma] Review feedback.
5eba88c [Prashant Sharma] Fixed style checks for ///+ comments.
e54b2f8 [Prashant Sharma] improved message, work around.
83e7144 [Prashant Sharma] removed dependency on scalastyle in plugin, since scalastyle sbt plugin already depends on the right version. Incase we update the plugin we will have to adjust our spark-style project to depend on right scalastyle version.
810a1d6 [Prashant Sharma] SPARK-1096, a space after comment style checker.
ba33193 [Prashant Sharma] scala style as a project

60abc252

Make sed do -i '' on OSX · 632c3220

Nick Lanham authored 10 years ago

I don't have access to an OSX machine, so if someone could test this that would be great.

Author: Nick Lanham <nick@afternight.org>

Closes #258 from nicklan/osx-sed-fix and squashes the following commits:

a6f158f [Nick Lanham] Also make mktemp work on OSX
558fd6e [Nick Lanham] Make sed do -i '' on OSX

632c3220

[SPARK-1210] Prevent ContextClassLoader of Actor from becoming ClassLoader of Executo... · 3d89043b

Takuya UESHIN authored 10 years ago

...r.

Constructor of `org.apache.spark.executor.Executor` should not set context class loader of current thread, which is backend Actor's thread.

Run the following code in local-mode REPL.

```
scala> case class Foo(i: Int)
scala> val ret = sc.parallelize((1 to 100).map(Foo), 10).collect
```

This causes errors as follows:

```
ERROR actor.OneForOneStrategy: [L$line5.$read$$iwC$$iwC$$iwC$$iwC$Foo;
java.lang.ArrayStoreException: [L$line5.$read$$iwC$$iwC$$iwC$$iwC$Foo;
     at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:88)
     at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:870)
     at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:870)
     at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
     at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:859)
     at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:616)
     at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
     at akka.actor.ActorCell.invoke(ActorCell.scala:456)
     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
     at akka.dispatch.Mailbox.run(Mailbox.scala:219)
     at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
     at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
     at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
     at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
```

This is because the class loaders to deserialize result `Foo` instances might be different from backend Actor's, and the Actor's class loader should be the same as Driver's.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #15 from ueshin/wip/wrongcontextclassloader and squashes the following commits:

d79e8c0 [Takuya UESHIN] Change a parent class loader of ExecutorURLClassLoader.
c6c09b6 [Takuya UESHIN] Add a test to collect objects of class defined in repl.
43e0feb [Takuya UESHIN] Prevent ContextClassLoader of Actor from becoming ClassLoader of Executor.

3d89043b

Mar 27, 2014

[SPARK-1268] Adding XOR and AND-NOT operations to spark.util.collection.BitSet · 6f986f0b

Petko Nikolov authored 10 years ago

Symmetric difference (xor) in particular is useful for computing some distance metrics (e.g. Hamming). Unit tests added.

Author: Petko Nikolov <nikolov@soundcloud.com>

Closes #172 from petko-nikolov/bitset-imprv and squashes the following commits:

451f28b [Petko Nikolov] fixed style mistakes
5beba18 [Petko Nikolov] rm outer loop in andNot test
0e61035 [Petko Nikolov] conform to spark style; rm redundant asserts; more unit tests added; use arraycopy instead of loop
d53cdb9 [Petko Nikolov] rm incidentally added space
4e1df43 [Petko Nikolov] adding xor and and-not to BitSet; unit tests added

6f986f0b

SPARK-1335. Also increase perm gen / code cache for scalatest when invoked via Maven build · 53953d09

Sean Owen authored 10 years ago

I am observing build failures when the Maven build reaches tests in the new SQL components. (I'm on Java 7 / OSX 10.9). The failure is the usual complaint from scala, that it's out of permgen space, or that JIT out of code cache space.

I see that various build scripts increase these both for SBT. This change simply adds these settings to scalatest's arguments. Works for me and seems a bit more consistent.

(I also snuck in cures for new build warnings from new scaladoc. Felt too trivial for a new PR, although it's separate. Just something I also saw while examining the build output.)

Author: Sean Owen <sowen@cloudera.com>

Closes #253 from srowen/SPARK-1335 and squashes the following commits:

c0f2d31 [Sean Owen] Appease scalastyle with a newline at the end of the file
a02679c [Sean Owen] Fix scaladoc errors due to missing links, which are generating build warnings, from some recent doc changes. We apparently can't generate links outside the module.
b2c6a09 [Sean Owen] Add perm gen, code cache settings to scalatest, mirroring SBT settings elsewhere, which allows tests to complete in at least one environment where they are failing. (Also removed a duplicate -Xms setting elsewhere.)

53953d09

SPARK-1330 removed extra echo from comput_classpath.sh · 426042ad

Thomas Graves authored 10 years ago

remove the extra echo which prevents spark-class from working. Note that I did not update the comment above it, which is also wrong because I'm not sure what it should do.

Should hive only be included if explicitly built with sbt hive/assembly or should sbt assembly build it?

Author: Thomas Graves <tgraves@apache.org>

Closes #241 from tgravescs/SPARK-1330 and squashes the following commits:

b10d708 [Thomas Graves] SPARK-1330 removed extra echo from comput_classpath.sh

426042ad

Cut down the granularity of travis tests. · 5b2d863e

Michael Armbrust authored 10 years ago

This PR amortizes the cost of downloading all the jars and compiling core across more test cases. In one anecdotal run this change takes the cumulative time down from ~80 minutes to ~40 minutes.

Author: Michael Armbrust <michael@databricks.com>

Closes #255 from marmbrus/travis and squashes the following commits:

506b22d [Michael Armbrust] Cut down the granularity of travis tests so we can amortize the cost of compilation.

5b2d863e

Mar 26, 2014

[SPARK-1327] GLM needs to check addIntercept for intercept and weights · d679843a

Xiangrui Meng authored 10 years ago

GLM needs to check addIntercept for intercept and weights. The current implementation always uses the first weight as intercept. Added a test for training without adding intercept.

JIRA: https://spark-project.atlassian.net/browse/SPARK-1327

Author: Xiangrui Meng <meng@databricks.com>

Closes #236 from mengxr/glm and squashes the following commits:

bcac1ac [Xiangrui Meng] add two tests to ensure {Lasso, Ridge}.setIntercept will throw an exceptions
a104072 [Xiangrui Meng] remove protected to be compatible with 0.9
0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected
d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used

d679843a

SPARK-1325. The maven build error for Spark Tools · 1fa48d94

Sean Owen authored 10 years ago

This is just a slight variation on https://github.com/apache/spark/pull/234 and alternative suggestion for SPARK-1325. `scala-actors` is not necessary. `SparkBuild.scala` should be updated to reflect the direct dependency on `scala-reflect` and `scala-compiler`. And the `repl` build, which has the same dependencies, should also be consistent between Maven / SBT.

Author: Sean Owen <sowen@cloudera.com>
Author: witgo <witgo@qq.com>

Closes #240 from srowen/SPARK-1325 and squashes the following commits:

25bd7db [Sean Owen] Add necessary dependencies scala-reflect and scala-compiler to tools. Update repl dependencies, which are similar, to be consistent between Maven / SBT in this regard too.

1fa48d94

Spark 1095 : Adding explicit return types to all public methods · 3e63d98f

NirmalReddy authored 10 years ago

Excluded those that are self-evident and the cases that are discussed in the mailing list.

Author: NirmalReddy <nirmal_reddy2000@yahoo.com>
Author: NirmalReddy <nirmal.reddy@imaginea.com>

Closes #168 from NirmalReddy/Spark-1095 and squashes the following commits:

ac54b29 [NirmalReddy] import misplaced
8c5ff3e [NirmalReddy] Changed syntax of unit returning methods
02d0778 [NirmalReddy] fixed explicit types in all the other packages
1c17773 [NirmalReddy] fixed explicit types in core package

3e63d98f

SPARK-1324: SparkUI Should Not Bind to SPARK_PUBLIC_DNS · be6d96c1

Patrick Wendell authored 10 years ago

/cc @aarondav and @andrewor14

Author: Patrick Wendell <pwendell@gmail.com>

Closes #231 from pwendell/ui-binding and squashes the following commits:

e8025f8 [Patrick Wendell] SPARK-1324: SparkUI Should Not Bind to SPARK_PUBLIC_DNS

be6d96c1

[SQL] Add a custom serializer for maps since they do not have a no-arg constructor. · e15e5741

Michael Armbrust authored 10 years ago

Author: Michael Armbrust <michael@databricks.com>

Closes #243 from marmbrus/mapSer and squashes the following commits:

54045f7 [Michael Armbrust] Add a custom serializer for maps since they do not have a no-arg constructor.

e15e5741

[SQL] Un-ignore a test that is now passing. · 32cbdfd2

Michael Armbrust authored 10 years ago

Add golden answer for aforementioned test.

Also, fix golden test generation from sbt/sbt by setting the classpath correctly.

Author: Michael Armbrust <michael@databricks.com>

Closes #244 from marmbrus/partTest and squashes the following commits:

37a33c9 [Michael Armbrust] Un-ignore a test that is now passing, add golden answer for aforementioned test. Fix golden test generation from sbt/sbt.

32cbdfd2

Unified package definition format in Spark SQL · 345825d9

Cheng Lian authored 10 years ago

According to discussions in comments of PR #208, this PR unifies package definition format in Spark SQL.

Some broken links in ScalaDoc and typos detected along the way are also fixed.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #225 from liancheng/packageDefinition and squashes the following commits:

75c47b3 [Cheng Lian] Fixed file line length
4f87968 [Cheng Lian] Unified package definition format in Spark SQL

345825d9

SPARK-1322, top in pyspark should sort result in descending order. · a0853a39

Prashant Sharma authored 11 years ago

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #235 from ScrapCodes/SPARK-1322/top-rev-sort and squashes the following commits:

f316266 [Prashant Sharma] Minor change in comment.
58e58c6 [Prashant Sharma] SPARK-1322, top in pyspark should sort result in descending order.

a0853a39

SPARK-1321 Use Guava's top k implementation rather than our... · b859853b

Reynold Xin authored 11 years ago

SPARK-1321 Use Guava's top k implementation rather than our BoundedPriorityQueue based implementation

Also updated the documentation for top and takeOrdered.

On my simple test of sorting 100 million (Int, Int) tuples using Spark, Guava's top k implementation (in Ordering) is much faster than the BoundedPriorityQueue implementation for roughly sorted input (10 - 20X faster), and still faster for purely random input (2 - 5X).

Author: Reynold Xin <rxin@apache.org>

Closes #229 from rxin/takeOrdered and squashes the following commits:

0d11844 [Reynold Xin] Use Guava's top k implementation rather than our BoundedPriorityQueue based implementation. Also updated the documentation for top and takeOrdered.

b859853b

Mar 25, 2014

Initial experimentation with Travis CI configuration · 4f7d547b

Michael Armbrust authored 11 years ago

This is not intended to replace Jenkins immediately, and Jenkins will remain the CI of reference for merging pull requests in the near term. Long term, it is possible that Travis will give us better integration with github, so we are investigating its use.

Author: Michael Armbrust <michael@databricks.com>

Closes #230 from marmbrus/travis and squashes the following commits:

93f9a32 [Michael Armbrust] Add Apache license to .travis.yml
d7c0e78 [Michael Armbrust] Initial experimentation with Travis CI configuration

4f7d547b

Avoid Option while generating call site · 8237df80

witgo authored 11 years ago

This is an update on https://github.com/apache/spark/pull/180, which changes the solution from blacklisting "Option.scala" to avoiding the Option code path while generating the call path.

Also includes a unit test to prevent this issue in the future, and some minor refactoring.

Thanks @witgo for reporting this issue and working on the initial solution!

Author: witgo <witgo@qq.com>
Author: Aaron Davidson <aaron@databricks.com>

Closes #222 from aarondav/180 and squashes the following commits:

f74aad1 [Aaron Davidson] Avoid Option while generating call site & add unit tests
d2b4980 [witgo] Modify the position of the filter
1bc22d7 [witgo] Fix Stage.name return "apply at Option.scala:120"

8237df80

SPARK-1319: Fix scheduler to account for tasks using > 1 CPUs. · f8111eae

Shivaram Venkataraman authored 11 years ago

Move CPUS_PER_TASK to TaskSchedulerImpl as the value is a constant and use it in both Mesos and CoarseGrained scheduler backends.

Thanks @kayousterhout for the design discussion

Author: Shivaram Venkataraman <shivaram@eecs.berkeley.edu>

Closes #219 from shivaram/multi-cpus and squashes the following commits:

5c7d685 [Shivaram Venkataraman] Don't pass availableCpus to TaskSetManager
260e4d5 [Shivaram Venkataraman] Add a check for non-zero CPUs in TaskSetManager
73fcf6f [Shivaram Venkataraman] Add documentation for spark.task.cpus
647bc45 [Shivaram Venkataraman] Fix scheduler to account for tasks using > 1 CPUs. Move CPUS_PER_TASK to TaskSchedulerImpl as the value is a constant and use it in both Mesos and CoarseGrained scheduler backends.

f8111eae

SPARK-1316. Remove use of Commons IO · 71d4ed27

Sean Owen authored 11 years ago

(This follows from a side point on SPARK-1133, in discussion of the PR: https://github.com/apache/spark/pull/164 )

Commons IO is barely used in the project, and can easily be replaced with equivalent calls to Guava or the existing Spark `Utils.scala` class.

Removing a dependency feels good, and this one in particular can get a little problematic since Hadoop uses it too.

Author: Sean Owen <sowen@cloudera.com>

Closes #226 from srowen/SPARK-1316 and squashes the following commits:

21efef3 [Sean Owen] Remove use of Commons IO

71d4ed27

Add more hive compatability tests to whitelist · 134ace7f

Michael Armbrust authored 11 years ago

Author: Michael Armbrust <michael@databricks.com>

Closes #220 from marmbrus/moreTests and squashes the following commits:

223ec35 [Michael Armbrust] Blacklist machine specific test
9c966cc [Michael Armbrust] add more hive compatability tests to whitelist

134ace7f

SPARK-1286: Make usage of spark-env.sh idempotent · 007a7334

Aaron Davidson authored 11 years ago

Various spark scripts load spark-env.sh. This can cause growth of any variables that may be appended to (SPARK_CLASSPATH, SPARK_REPL_OPTS) and it makes the precedence order for options specified in spark-env.sh less clear.

One use-case for the latter is that we want to set options from the command-line of spark-shell, but these options will be overridden by subsequent loading of spark-env.sh. If we were to load the spark-env.sh first and then set our command-line options, we could guarantee correct precedence order.

Note that we use SPARK_CONF_DIR if available to support the sbin/ scripts, which always set this variable from sbin/spark-config.sh. Otherwise, we default to the ../conf/ as usual.

Author: Aaron Davidson <aaron@databricks.com>

Closes #184 from aarondav/idem and squashes the following commits:

e291f91 [Aaron Davidson] Use "private" variables in load-spark-env.sh
8da8360 [Aaron Davidson] Add .sh extension to load-spark-env.sh
93a2471 [Aaron Davidson] SPARK-1286: Make usage of spark-env.sh idempotent

007a7334

Unify the logic for column pruning, projection, and filtering of table scans. · b637f2d9

Michael Armbrust authored 11 years ago

This removes duplicated logic, dead code and casting when planning parquet table scans and hive table scans.

Other changes:
 - Fix tests now that we are doing a better job of column pruning (i.e., since pruning predicates are applied before we even start scanning tuples, columns required by these predicates do not need to be included in the output of the scan unless they are also included in the final output of this logical plan fragment).
 - Add rule to simplify trivial filters.  This was required to avoid `WHERE false` from getting pushed into table scans, since `HiveTableScan` (reasonably) refuses to apply partition pruning predicates to non-partitioned tables.

Author: Michael Armbrust <michael@databricks.com>

Closes #213 from marmbrus/strategyCleanup and squashes the following commits:

48ce403 [Michael Armbrust] Move one more bit of parquet stuff into the core SQLContext.
834ce08 [Michael Armbrust] Address comments.
0f2c6f5 [Michael Armbrust] Unify the logic for column pruning, projection, and filtering of table scans for both Hive and Parquet relations.  Fix tests now that we are doing a better job of column pruning.

b637f2d9

Mar 24, 2014

SPARK-1128: set hadoop task properties when constructing HadoopRDD · 5140598d

CodingCat authored 11 years ago

https://spark-project.atlassian.net/browse/SPARK-1128

The task properties are not set when constructing HadoopRDD in current implementation, this may limit the implementation based on

```
mapred.tip.id
mapred.task.id
mapred.task.is.map
mapred.task.partition
mapred.job.id
```

This patch also contains a small fix  in createJobID (SparkHadoopWriter.scala), where the current implementation actually is not using time parameter

Author: CodingCat <zhunansjtu@gmail.com>
Author: Nan Zhu <CodingCat@users.noreply.github.com>

Closes #101 from CodingCat/SPARK-1128 and squashes the following commits:

ed0980f [CodingCat] make SparkHiveHadoopWriter belongs to spark package
5b1ad7d [CodingCat] move SparkHiveHadoopWriter to org.apache.spark package
258f92c [CodingCat] code cleanup
af88939 [CodingCat] update the comments and permission of SparkHadoopWriter
9bd1fe3 [CodingCat] move configuration for jobConf to HadoopRDD
b7bdfa5 [Nan Zhu] style fix
a3153a8 [Nan Zhu] style fix
c3258d2 [CodingCat] set hadoop task properties while using InputFormat

5140598d

SPARK-1094 Support MiMa for reporting binary compatibility accross versions. · dc126f21

Patrick Wendell authored 11 years ago

This adds some changes on top of the initial work by @scrapcodes in #20:

The goal here is to do automated checking of Spark commits to determine whether they break binary compatibility.

1. Special case for inner classes of package-private objects.
2. Made tools classes accessible when running `spark-class`.
3. Made some declared types in MLLib more general.
4. Various other improvements to exclude-generation script.
5. In-code documentation.

Author: Patrick Wendell <pwendell@gmail.com>
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Prashant Sharma <scrapcodes@gmail.com>

Closes #207 from pwendell/mima and squashes the following commits:

22ae267 [Patrick Wendell] New binary changes after upmerge
6c2030d [Patrick Wendell] Merge remote-tracking branch 'apache/master' into mima
3666cf1 [Patrick Wendell] Minor style change
0e0f570 [Patrick Wendell] Small fix and removing directory listings
647c547 [Patrick Wendell] Reveiw feedback.
c39f3b5 [Patrick Wendell] Some enhancements to binary checking.
4c771e0 [Prashant Sharma] Added a tool to generate mima excludes and also adapted build to pick automatically.
b551519 [Prashant Sharma] adding a new exclude after rebasing with master
651844c [Prashant Sharma] Support MiMa for reporting binary compatibility accross versions.

dc126f21

SPARK-1294 Fix resolution of uppercase field names using a HiveContext. · 8043b7bc

Michael Armbrust authored 11 years ago

Fixing this bug required the following:
 - Creation of a new logical node that converts a schema to lowercase.
 - Generalization of the subquery eliding rule to also elide this new node
 - Fixing of several places where too tight assumptions were made on the types of `InsertIntoTable` children.
 - I also removed an API that was left in by accident that exposed catalyst data structures, and fix the logic that pushes down filters into hive tables scans to correctly compare attribute references.

Author: Michael Armbrust <michael@databricks.com>

Closes #202 from marmbrus/upperCaseFieldNames and squashes the following commits:

15e5265 [Michael Armbrust] Support for resolving mixed case fields from a reflected schema using HiveQL.
5aa5035 [Michael Armbrust] Remove API that exposes internal catalyst data structures.
9d99cb6 [Michael Armbrust] Attributes should be compared using exprId, not TreeNode.id.

8043b7bc

HOT FIX: Exclude test files from RAT · 56db8a2f
Patrick Wendell authored 11 years ago

56db8a2f

SPARK-1144 Added license and RAT to check licenses. · 21109fba

Prashant Sharma authored 11 years ago

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #125 from ScrapCodes/rat-integration and squashes the following commits:

64f7c7d [Prashant Sharma] added license headers.
fcf28b1 [Prashant Sharma] Review feedback.
c0648db [Prashant Sharma] SPARK-1144 Added license and RAT to check licenses.

21109fba

Mar 23, 2014

[SPARK-1212] Adding sparse data support and update KMeans · 80c29689

Xiangrui Meng authored 11 years ago

Continue our discussions from https://github.com/apache/incubator-spark/pull/575

This PR is WIP because it depends on a SNAPSHOT version of breeze.

Per previous discussions and benchmarks, I switched to breeze for linear algebra operations. @dlwh and I made some improvements to breeze to keep its performance comparable to the bare-bone implementation, including norm computation and squared distance. This is why this PR needs to depend on a SNAPSHOT version of breeze.

@fommil , please find the notice of using netlib-core in `NOTICE`. This is following Apache's instructions on appropriate labeling.

I'm going to update this PR to include:

1. Fast distance computation: using `\|a\|_2^2 + \|b\|_2^2 - 2 a^T b` when it doesn't introduce too much numerical error. The squared norms are pre-computed. Otherwise, computing the distance between the center (dense) and a point (possibly sparse) always takes O(n) time.

2. Some numbers about the performance.

3. A released version of breeze. @dlwh, a minor release of breeze will help this PR get merged early. Do you mind sharing breeze's release plan? Thanks!

Author: Xiangrui Meng <meng@databricks.com>

Closes #117 from mengxr/sparse-kmeans and squashes the following commits:

67b368d [Xiangrui Meng] fix SparseVector.toArray
5eda0de [Xiangrui Meng] update NOTICE
67abe31 [Xiangrui Meng] move ArrayRDDs to mllib.rdd
1da1033 [Xiangrui Meng] remove dependency on commons-math3 and compute EPSILON directly
9bb1b31 [Xiangrui Meng] optimize SparseVector.toArray
226d2cd [Xiangrui Meng] update Java friendly methods in Vectors
238ba34 [Xiangrui Meng] add VectorRDDs with a converter from RDD[Array[Double]]
b28ba2f [Xiangrui Meng] add toArray to Vector
e69b10c [Xiangrui Meng] remove examples/JavaKMeans.java, which is replaced by mllib/examples/JavaKMeans.java
72bde33 [Xiangrui Meng] clean up code for distance computation
712cb88 [Xiangrui Meng] make Vectors.sparse Java friendly
27858e4 [Xiangrui Meng] update breeze version to 0.7
07c3cf2 [Xiangrui Meng] change Mahout to breeze in doc use a simple lower bound to avoid unnecessary distance computation
6f5cdde [Xiangrui Meng] fix a bug in filtering finished runs
42512f2 [Xiangrui Meng] Merge branch 'master' into sparse-kmeans
d6e6c07 [Xiangrui Meng] add predict(RDD[Vector]) to KMeansModel
42b4e50 [Xiangrui Meng] line feed at the end
a4ace73 [Xiangrui Meng] Merge branch 'fast-dist' into sparse-kmeans
3ed1a24 [Xiangrui Meng] add doc to BreezeVectorWithSquaredNorm
0107e19 [Xiangrui Meng] update NOTICE
87bc755 [Xiangrui Meng] tuned the KMeans code: changed some for loops to while, use view to avoid copying arrays
0ff8046 [Xiangrui Meng] update KMeans to use fastSquaredDistance
f355411 [Xiangrui Meng] add BreezeVectorWithSquaredNorm case class
ab74f67 [Xiangrui Meng] add fastSquaredDistance for KMeans
4e7d5ca [Xiangrui Meng] minor style update
07ffaf2 [Xiangrui Meng] add dense/sparse vector data models and conversions to/from breeze vectors use breeze to implement KMeans in order to support both dense and sparse data

80c29689

Fixed coding style issues in Spark SQL · 8265dc77

Cheng Lian authored 11 years ago

This PR addresses various coding style issues in Spark SQL, including but not limited to those mentioned by @mateiz in PR #146.

As this PR affects lots of source files and may cause potential conflicts, it would be better to merge this as soon as possible *after* PR #205 (In-memory columnar representation for Spark SQL) is merged.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #208 from liancheng/fixCodingStyle and squashes the following commits:

fc2b528 [Cheng Lian] Merge branch 'master' into fixCodingStyle
b531273 [Cheng Lian] Fixed coding style issues in sql/hive
0b56f77 [Cheng Lian] Fixed coding style issues in sql/core
fae7b02 [Cheng Lian] Addressed styling issues mentioned by @marmbrus
9265366 [Cheng Lian] Fixed coding style issues in sql/core
3dcbbbd [Cheng Lian] Fixed relative package imports for package catalyst

8265dc77

[SPARK-1292] In-memory columnar representation for Spark SQL · 57a4379c

Cheng Lian authored 11 years ago

This PR is rebased from the Catalyst repository, and contains the first version of in-memory columnar representation for Spark SQL. Compression support is not included yet and will be added later in a separate PR.

Author: Cheng Lian <lian@databricks.com>
Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #205 from liancheng/memColumnarSupport and squashes the following commits:

99dba41 [Cheng Lian] Restricted new objects/classes to `private[sql]'
0892ad8 [Cheng Lian] Addressed ScalaStyle issues
af1ad5e [Cheng Lian] Fixed some minor issues introduced during rebasing
0dbf2fb [Cheng Lian] Make necessary renaming due to rebase
a162d4d [Cheng Lian] Removed the unnecessary InMemoryColumnarRelation class
9bcae4b [Cheng Lian] Added Apache license
220ee1e [Cheng Lian] Added table scan operator for in-memory columnar support.
c701c7a [Cheng Lian] Using SparkSqlSerializer for generic object SerDe causes error, made a workaround
ed8608e [Cheng Lian] Added implicit conversion from DataType to ColumnType
b8a645a [Cheng Lian] Replaced KryoSerializer with an updated SparkSqlSerializer
b6c0a49 [Cheng Lian] Minor test suite refactoring
214be73 [Cheng Lian] Refactored BINARY and GENERIC to reduce duplicate code
da2f4d5 [Cheng Lian] Added Apache license
dbf7a38 [Cheng Lian] Added ColumnAccessor and test suite, refactored ColumnBuilder
c01a177 [Cheng Lian] Added column builder classes and test suite
f18ddc6 [Cheng Lian] Added ColumnTypes and test suite
2d09066 [Cheng Lian] Added KryoSerializer
34f3c19 [Cheng Lian] Added TypeTag field to all NativeTypes
acc5c48 [Cheng Lian] Added Hive test files to .gitignore

57a4379c

SPARK-1254. Supplemental fix for HTTPS on Maven Central · abf6714e

Sean Owen authored 11 years ago

It seems that HTTPS does not necessarily work on Maven Central, as it does not today at least. Back to HTTP. Both builds works from a clean repo.

Author: Sean Owen <sowen@cloudera.com>

Closes #209 from srowen/SPARK-1254Fix and squashes the following commits:

bb7be47 [Sean Owen] Revert to HTTP for Maven Central repo, as it seems HTTPS does not necessarily work

abf6714e

Mar 21, 2014

Fix to Stage UI to display numbers on progress bar · 646e5540

Emtiaz Ahmed authored 11 years ago

Fixes an issue on Stage UI to display numbers on progress bar which are today hidden behind the progress bar div. Please refer to the attached images to see the issue.
![screen shot 2014-03-21 at 4 48 46 pm](https://f.cloud.github.com/assets/563652/2489083/8c127e80-b153-11e3-807c-048ebd45104b.png)
![screen shot 2014-03-21 at 4 49 00 pm](https://f.cloud.github.com/assets/563652/2489084/8c12cf5c-b153-11e3-8747-9d93ff6fceb4.png)

Author: Emtiaz Ahmed <emtiazahmed@gmail.com>

Closes #201 from emtiazahmed/master and squashes the following commits:

a7964fe [Emtiaz Ahmed] Fix to Stage UI to display numbers on progress bar

646e5540

Add asCode function for dumping raw tree representations. · d7809836

Michael Armbrust authored 11 years ago

Intended only for use by Catalyst developers.

Author: Michael Armbrust <michael@databricks.com>

Closes #200 from marmbrus/asCode and squashes the following commits:

7e8c1d9 [Michael Armbrust] Add asCode function for dumping raw tree representations. Intended only for use by Catalyst developers.

d7809836

Make SQL keywords case-insensitive · dab5439a

Matei Zaharia authored 11 years ago

This is a bit of a hack that allows all variations of a keyword, but it still seems to produce valid error messages and such.

Author: Matei Zaharia <matei@databricks.com>

Closes #193 from mateiz/case-insensitive-sql and squashes the following commits:

0ee4ace [Matei Zaharia] Removed unnecessary `+ ""`
e3ed773 [Matei Zaharia] Make SQL keywords case-insensitive

dab5439a