Commits · f4f46dec5ae1da48738b9b650d3de155b59c4674 · cs525-sp18-g07 / spark

Jul 11, 2014

[Minor] Remove unused val in Master · f4f46dec

Andrew Or authored 10 years ago

Author: Andrew Or <andrewor14@gmail.com>

Closes #1365 from andrewor14/master-fs and squashes the following commits:

497f100 [Andrew Or] Sneak in a space and hope no one will notice
05ba6da [Andrew Or] Remove unused val

f4f46dec

fix Graph partitionStrategy comment · 282cca0e

CrazyJvm authored 10 years ago

Author: CrazyJvm <crazyjvm@gmail.com>

Closes #1368 from CrazyJvm/graph-comment-1 and squashes the following commits:

d47f3c5 [CrazyJvm] fix style
e190d6f [CrazyJvm] fix Graph partitionStrategy comment

282cca0e

Jul 10, 2014

[SPARK-2358][MLLIB] Add an option to include native BLAS/LAPACK loader in the build · 2f59ce7d

Xiangrui Meng authored 10 years ago

It would be easy for users to include the netlib-java jniloader in the spark jar, which is LGPL-licensed. We can follow the same approach as ganglia support in Spark, which could be enabled by turning on "-Pganglia-lgpl" at build time. We can use "-Pnetlib-lgpl" flag for this.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1295 from mengxr/netlib-lgpl and squashes the following commits:

aebf001 [Xiangrui Meng] add a profile to optionally include native BLAS/LAPACK loader in mllib

2f59ce7d

[SPARK-2428][SQL] Add except and intersect methods to SchemaRDD. · 10b59ba2

Takuya UESHIN authored 10 years ago

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1355 from ueshin/issues/SPARK-2428 and squashes the following commits:

b6fa264 [Takuya UESHIN] Add except and intersect methods to SchemaRDD.

10b59ba2

[SPARK-2415] [SQL] RowWriteSupport should handle empty ArrayType correctly. · f5abd271

Takuya UESHIN authored 10 years ago

`RowWriteSupport` doesn't write empty `ArrayType` value, so the read value becomes `null`.
It should write empty `ArrayType` value as it is.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1339 from ueshin/issues/SPARK-2415 and squashes the following commits:

32afc87 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-2415
2f05196 [Takuya UESHIN] Fix RowWriteSupport to handle empty ArrayType correctly.

f5abd271

[SPARK-2431][SQL] Refine StringComparison and related codes. · f62c4272

Takuya UESHIN authored 10 years ago

Refine `StringComparison` and related codes as follows:
- `StringComparison` could be similar to `StringRegexExpression` or `CaseConversionExpression`.
- Nullability of `StringRegexExpression` could depend on children's nullabilities.
- Add a case that the like condition includes no wildcard to `LikeSimplification`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1357 from ueshin/issues/SPARK-2431 and squashes the following commits:

77766f5 [Takuya UESHIN] Add a case that the like condition includes no wildcard to LikeSimplification.
b9da9d2 [Takuya UESHIN] Fix nullability of StringRegexExpression.
680bb72 [Takuya UESHIN] Refine StringComparison.

f62c4272

SPARK-2427: Fix Scala examples that use the wrong command line arguments index · ae8ca4df

Artjom-Metro authored 10 years ago

The Scala examples HBaseTest and HdfsTest don't use the correct indexes for the command line arguments. This due to to the fix of JIRA 1565, where these examples were not correctly adapted to the new usage of the submit script.

Author: Artjom-Metro <Artjom-Metro@users.noreply.github.com>
Author: Artjom-Metro <artjom31415@googlemail.com>

Closes #1353 from Artjom-Metro/fix_examples and squashes the following commits:

6111801 [Artjom-Metro] Reduce the default number of iterations
cfaa73c [Artjom-Metro] Fix some examples that use the wrong index to access the command line arguments

ae8ca4df

[SPARK-1341] [Streaming] Throttle BlockGenerator to limit rate of data consumption. · 2dd67248

Issac Buenrostro authored 10 years ago

Author: Issac Buenrostro <buenrostro@ooyala.com>

Closes #945 from ibuenros/SPARK-1341-throttle and squashes the following commits:

5514916 [Issac Buenrostro] Formatting changes, added documentation for streaming throttling, stricter unit tests for throttling.
62f395f [Issac Buenrostro] Add comments and license to streaming RateLimiter.scala
7066438 [Issac Buenrostro] Moved throttle code to RateLimiter class, smoother pushing when throttling active
ccafe09 [Issac Buenrostro] Throttle BlockGenerator to limit rate of data consumption.

2dd67248

[SPARK-1478].3: Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915 · 40a8fef4

tmalaska authored 10 years ago

This is a modified version of this PR https://github.com/apache/spark/pull/1168 done by @tmalaska
Adds MIMA binary check exclusions.

Author: tmalaska <ted.malaska@cloudera.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #1347 from tdas/FLUME-1915 and squashes the following commits:

96065df [Tathagata Das] Added Mima exclusion for FlumeReceiver.
41d5338 [tmalaska] Address line 57 that was too long
12617e5 [tmalaska] SPARK-1478: Upgrade FlumeInputDStream's Flume...

40a8fef4

name ec2 instances and security groups consistently · 369aa84e

Nicholas Chammas authored 10 years ago

Security groups created by `spark-ec2` do not prepend “spark-“ to the
name.

Since naming the instances themselves is new to `spark-ec2`, it’s better
to change that pattern to match the existing naming pattern for the
security groups, rather than the other way around.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>
Author: nchammas <nicholas.chammas@gmail.com>

Closes #1344 from nchammas/master and squashes the following commits:

f7e4581 [Nicholas Chammas] unrelated pep8 fix
a36eed0 [Nicholas Chammas] name ec2 instances and security groups consistently
de7292a [nchammas] Merge pull request #4 from apache/master
2e4fe00 [nchammas] Merge pull request #3 from apache/master
89fde08 [nchammas] Merge pull request #2 from apache/master
69f6e22 [Nicholas Chammas] PEP8 fixes
2627247 [Nicholas Chammas] broke up lines before they hit 100 chars
6544b7e [Nicholas Chammas] [SPARK-2065] give launched instances names
69da6cf [nchammas] Merge pull request #1 from apache/master

369aa84e

HOTFIX: Minor doc update for sbt change · 88006a62
Patrick Wendell authored 10 years ago

88006a62

[SPARK-1776] Have Spark's SBT build read dependencies from Maven. · 628932b8

Prashant Sharma authored 10 years ago

Patch introduces the new way of working also retaining the existing ways of doing things.

For example build instruction for yarn in maven is
`mvn -Pyarn -PHadoop2.2 clean package -DskipTests`
in sbt it can become
`MAVEN_PROFILES="yarn, hadoop-2.2" sbt/sbt clean assembly`
Also supports
`sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 clean assembly`

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #772 from ScrapCodes/sbt-maven and squashes the following commits:

a8ac951 [Prashant Sharma] Updated sbt version.
62b09bb [Prashant Sharma] Improvements.
fa6221d [Prashant Sharma] Excluding sql from mima
4b8875e [Prashant Sharma] Sbt assembly no longer builds tools by default.
72651ca [Prashant Sharma] Addresses code reivew comments.
acab73d [Prashant Sharma] Revert "Small fix to run-examples script."
ac4312c [Prashant Sharma] Revert "minor fix"
6af91ac [Prashant Sharma] Ported oldDeps back. + fixes issues with prev commit.
65cf06c [Prashant Sharma] Servelet API jars mess up with the other servlet jars on the class path.
446768e [Prashant Sharma] minor fix
89b9777 [Prashant Sharma] Merge conflicts
d0a02f2 [Prashant Sharma] Bumped up pom versions, Since the build now depends on pom it is better updated there. + general cleanups.
dccc8ac [Prashant Sharma] updated mima to check against 1.0
a49c61b [Prashant Sharma] Fix for tools jar
a2f5ae1 [Prashant Sharma] Fixes a bug in dependencies.
cf88758 [Prashant Sharma] cleanup
9439ea3 [Prashant Sharma] Small fix to run-examples script.
96cea1f [Prashant Sharma] SPARK-1776 Have Spark's SBT build read dependencies from Maven.
36efa62 [Patrick Wendell] Set project name in pom files and added eclipse/intellij plugins.
4973dbd [Patrick Wendell] Example build using pom reader.

628932b8

SPARK-2115: Stage kill link is too close to stage details link · c2babc08

Masayoshi TSUZUKI authored 10 years ago

Moved (kill) link to the right side. Add confirmation dialog when (kill) link is clicked.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #1350 from tsudukim/feature/SPARK-2115 and squashes the following commits:

e2263b0 [Masayoshi TSUZUKI] Moved (kill) link to the right side. Add confirmation dialog when (kill) link is clicked.

c2babc08

Clean up SparkKMeans example's code · 2b18ea98

Raymond Liu authored 10 years ago

remove unused code

Author: Raymond Liu <raymond.liu@intel.com>

Closes #1352 from colorant/kmeans and squashes the following commits:

ddcd1dd [Raymond Liu] Clean up SparkKMeans example's code

2b18ea98

Jul 09, 2014

HOTFIX: Remove persistently failing test in master. · 553c578d
Patrick Wendell authored 10 years ago
```
Apparently this functionality is going to be removed soon anywyas.
```
553c578d
Revert "[HOTFIX] Synchronize on SQLContext.settings in tests." · dd22bc2d
Patrick Wendell authored 10 years ago
```
This reverts commit d4c30cd9.
```
dd22bc2d

SPARK-2416: Allow richer reporting of unit test results · 2e0a037d

Patrick Wendell authored 10 years ago

The built-in Jenkins integration is pretty bad. It's very confusing to users whether tests have passed or failed and we can't easily customize the message.

With some small scripting around the Github API we can do much better than this.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #1340 from pwendell/better-qa-messages and squashes the following commits:

fd6077d [Patrick Wendell] Better automation for unit tests.

2e0a037d

SPARK-1782: svd for sparse matrix using ARPACK · 1f33e1f2

Li Pu authored 10 years ago

copy ARPACK dsaupd/dseupd code from latest breeze
change RowMatrix to use sparse SVD
change tests for sparse SVD

All tests passed. I will run it against some large matrices.

Author: Li Pu <lpu@twitter.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Li Pu <li.pu@outlook.com>

Closes #964 from vrilleup/master and squashes the following commits:

7312ec1 [Li Pu] very minor comment fix
4c618e9 [Li Pu] Merge pull request #1 from mengxr/vrilleup-master
a461082 [Xiangrui Meng] make superscript show up correctly in doc
861ec48 [Xiangrui Meng] simplify axpy
62969fa [Xiangrui Meng] use BDV directly in symmetricEigs change the computation mode to local-svd, local-eigs, and dist-eigs update tests and docs
c273771 [Li Pu] automatically determine SVD compute mode and parameters
7148426 [Li Pu] improve RowMatrix multiply
5543cce [Li Pu] improve svd api
819824b [Li Pu] add flag for dense svd or sparse svd
eb15100 [Li Pu] fix binary compatibility
4c7aec3 [Li Pu] improve comments
e7850ed [Li Pu] use aggregate and axpy
827411b [Li Pu] fix EOF new line
9c80515 [Li Pu] use non-sparse implementation when k = n
fe983b0 [Li Pu] improve scala style
96d2ecb [Li Pu] improve eigenvalue sorting
e1db950 [Li Pu] SPARK-1782: svd for sparse matrix using ARPACK

1f33e1f2

[SPARK-2417][MLlib] Fix DecisionTree tests · d35e3db2

johnnywalleye authored 10 years ago

Fixes test failures introduced by https://github.com/apache/spark/pull/1316.

For both the regression and classification cases,
val stats is the InformationGainStats for the best tree split.
stats.predict is the predicted value for the data, before the split is made.
Since 600 of the 1,000 values generated by DecisionTreeSuite.generateCategoricalDataPoints() are 1.0 and the rest 0.0, the regression tree and classification tree both correctly predict a value of 0.6 for this data now, and the assertions have been changed to reflect that.

Author: johnnywalleye <jsondag@gmail.com>

Closes #1343 from johnnywalleye/decision-tree-tests and squashes the following commits:

ef80603 [johnnywalleye] [SPARK-2417][MLlib] Fix DecisionTree tests

d35e3db2

[STREAMING] SPARK-2343: Fix QueueInputDStream with oneAtATime false · 0eb11527

Manuel Laflamme authored 10 years ago

Fix QueueInputDStream which was not removing dequeued items when used with the oneAtATime flag disabled.

Author: Manuel Laflamme <manuel.laflamme@gmail.com>

Closes #1285 from mlaflamm/spark-2343 and squashes the following commits:

61c9e38 [Manuel Laflamme] Unit tests for queue input stream
c51d029 [Manuel Laflamme] Fix QueueInputDStream with oneAtATime false

0eb11527

[SPARK-2384] Add tooltips to UI. · 339441f5

Kay Ousterhout authored 10 years ago

This patch adds tooltips to clarify some points of confusion in the UI.  When users mouse over some of the table headers (shuffle read, write, and input size) as well as over the "scheduler delay" metric shown for each stage, a black tool tip (see image below) pops up describing the metric in more detail.  After the tooltip mechanism is added by this commit, I imagine others may want to add more tooltips for other things in the UI, but I think this is a good starting point.

![tooltip](https://cloud.githubusercontent.com/assets/1108612/3491905/994e179e-059f-11e4-92f2-c6c12d248d81.jpg)

This looks scary-big but much of it is adding the bootstrap tool tip JavaScript.

Also I have no idea what to put for the license in tooltip (I left it the same -- the Twitter apache header) or for JQuery (left it as nothing) -- @mateiz what's the right thing here?

cc @pwendell @andrewor14 @rxin

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #1314 from kayousterhout/tooltips and squashes the following commits:

19981b5 [Kay Ousterhout] Exclude non-licensed javascript files from style check
d9ab5a9 [Kay Ousterhout] Response to Andrew's review
7752449 [Kay Ousterhout] [SPARK-2384] Add tooltips to UI.

339441f5

Jul 08, 2014

[SPARK-2152][MLlib] fix bin offset in DecisionTree node aggregations (also resolves SPARK-2160) · 1114207c

johnnywalleye authored 10 years ago

Hi, this pull fixes (what I believe to be) a bug in DecisionTree.scala.

In the extractLeftRightNodeAggregates function, the first set of rightNodeAgg values for Regression are set in line 792 as follows:

rightNodeAgg(featureIndex)(2 * (numBins - 2))
  = binData(shift + (2 * numBins - 1)))

Then there is a loop that sets the rest of the values, as in line 809:

rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) =
  binData(shift + (2 *(numBins - 2 - splitIndex))) +
  rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex))

But since splitIndex starts at 1, this ends up skipping a set of binData values.

The changes here address this issue, for both the Regression and Classification cases.

Author: johnnywalleye <jsondag@gmail.com>

Closes #1316 from johnnywalleye/master and squashes the following commits:

73809da [johnnywalleye] fix bin offset in DecisionTree node aggregations

1114207c

[SPARK-2413] Upgrade junit_xml_listener to 0.5.1 · ac9cdc11

DB Tsai authored 10 years ago

which fixes the following issues

1) fix the class name to be fully qualified classpath
2) make sure the the reporting time is in second not in miliseond, which causing JUnit HTML to report incorrect number
3) make sure the duration of the tests are accumulative.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #1333 from dbtsai/dbtsai-junit and squashes the following commits:

bbeac4b [DB Tsai] Upgrade junit_xml_listener to 0.5.1 which fixes the following issues

ac9cdc11

[SPARK-2392] Executors should not start their own HTTP servers · bf04a390

Andrew Or authored 10 years ago

Executors currently start their own unused HTTP file servers. This is because we use the same SparkEnv class for both executors and drivers, and we do not distinguish this case.

In the longer term, we should separate out SparkEnv for the driver and SparkEnv for the executors.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1335 from andrewor14/executor-http-server and squashes the following commits:

46ef263 [Andrew Or] Start HTTP server only on the driver

bf04a390

[SPARK-2362] Fix for newFilesOnly logic in file DStream · e6f7bfcf

Gabriele Nizzoli authored 10 years ago

The newFilesOnly logic should be inverted: the logic should be that if the flag newFilesOnly==true then only start reading files older than current time. As the code is now if newFilesOnly==true then it will start to read files that are older than 0L (that is: every file in the directory).

Author: Gabriele Nizzoli <mail@nizzoli.net>

Closes #1077 from gabrielenizzoli/master and squashes the following commits:

4f1d261 [Gabriele Nizzoli] Fix for newFilesOnly logic in file DStream

e6f7bfcf

[SPARK-2409] Make SQLConf thread safe. · 32516f86

Reynold Xin authored 10 years ago

Author: Reynold Xin <rxin@apache.org>

Closes #1334 from rxin/sqlConfThreadSafetuy and squashes the following commits:

c1e0a5a [Reynold Xin] Fixed the duplicate comment.
7614372 [Reynold Xin] [SPARK-2409] Make SQLConf thread safe.

32516f86

SPARK-2400 : fix spark.yarn.max.executor.failures explaination · b520b645

CrazyJvm authored 10 years ago

According to
```scala
  private val maxNumExecutorFailures = sparkConf.getInt("spark.yarn.max.executor.failures",
    sparkConf.getInt("spark.yarn.max.worker.failures", math.max(args.numExecutors * 2, 3)))
```
default value should be numExecutors * 2, with minimum of 3,  and it's same to the config
`spark.yarn.max.worker.failures`

Author: CrazyJvm <crazyjvm@gmail.com>

Closes #1282 from CrazyJvm/yarn-doc and squashes the following commits:

1a5f25b [CrazyJvm] remove deprecated config
c438aec [CrazyJvm] fix style
86effa6 [CrazyJvm] change expression
211f130 [CrazyJvm] fix html tag
2900d23 [CrazyJvm] fix style
a4b2e27 [CrazyJvm] fix configuration spark.yarn.max.executor.failures

b520b645

[SPARK-2403] Catch all errors during serialization in DAGScheduler · c8a2313c

Daniel Darabos authored 10 years ago

https://issues.apache.org/jira/browse/SPARK-2403

Spark hangs for us whenever we forget to register a class with Kryo. This should be a simple fix for that. But let me know if you have a better suggestion.

I did not write a new test for this. It would be pretty complicated and I'm not sure it's worthwhile for such a simple change. Let me know if you disagree.

Author: Daniel Darabos <darabos.daniel@gmail.com>

Closes #1329 from darabos/spark-2403 and squashes the following commits:

3aceaad [Daniel Darabos] Print full stack trace for miscellaneous exceptions during serialization.
52c22ba [Daniel Darabos] Only catch NonFatal exceptions.
361e962 [Daniel Darabos] Catch all errors during serialization in DAGScheduler.

c8a2313c

[SPARK-2395][SQL] Optimize common LIKE patterns. · cc3e0a14

Michael Armbrust authored 10 years ago

Author: Michael Armbrust <michael@databricks.com>

Closes #1325 from marmbrus/slowLike and squashes the following commits:

023c3eb [Michael Armbrust] add comment.
8b421c2 [Michael Armbrust] Handle the case where the final % is actually escaped.
d34d37e [Michael Armbrust] add periods.
3bbf35f [Michael Armbrust] Roll back changes to SparkBuild
53894b1 [Michael Armbrust] Fix grammar.
4094462 [Michael Armbrust] Fix grammar.
6d3d0a0 [Michael Armbrust] Optimize common LIKE patterns.

cc3e0a14

[EC2] Add default history server port to ec2 script · 56e009d4

Andrew Or authored 10 years ago

Right now I have to open it manually

Author: Andrew Or <andrewor14@gmail.com>

Closes #1296 from andrewor14/hist-serv-port and squashes the following commits:

8895a1f [Andrew Or] Add default history server port to ec2 script

56e009d4

[SPARK-2391][SQL] Custom take() for LIMIT queries. · 5a406364

Michael Armbrust authored 10 years ago

Using Spark's take can result in an entire in-memory partition to be shipped in order to retrieve a single row.

Author: Michael Armbrust <michael@databricks.com>

Closes #1318 from marmbrus/takeLimit and squashes the following commits:

77289a5 [Michael Armbrust] Update scala doc
32f0674 [Michael Armbrust] Custom take implementation for LIMIT queries.

5a406364

Resolve sbt warnings during build Ⅱ · 3cd5029b

witgo authored 10 years ago

Author: witgo <witgo@qq.com>

Closes #1153 from witgo/expectResult and squashes the following commits:

97541d8 [witgo] merge master
ead26e7 [witgo] Resolve sbt warnings during build

3cd5029b

Updated programming-guide.md · 0128905e

Rishi Verma authored 10 years ago

Made sure that readers know the random number generator seed argument, within the 'takeSample' method, is optional.

Author: Rishi Verma <riverma@apache.org>

Closes #1324 from riverma/patch-1 and squashes the following commits:

4699676 [Rishi Verma] Updated programming-guide.md

0128905e

Jul 07, 2014

[SPARK-2235][SQL]Spark SQL basicOperator add Intersect operator · 50561f43

Yanjie Gao authored 10 years ago

Hi all,
I want to submit a basic operator Intersect
For example , in sql case
select * from table1
intersect
select * from table2
So ,i want use this operator support this function in Spark SQL
This operator will return the  the intersection of SparkPlan child table RDD .
JIRA:https://issues.apache.org/jira/browse/SPARK-2235

Author: Yanjie Gao <gaoyanjie55@163.com>
Author: YanjieGao <396154235@qq.com>

Closes #1150 from YanjieGao/patch-5 and squashes the following commits:

4629afe [YanjieGao] reformat the code
bdc2ac0 [YanjieGao] reformat the code as Michael's suggestion
3b29ad6 [YanjieGao] Merge remote branch 'upstream/master' into patch-5
1cfbfe6 [YanjieGao] refomat some files
ea78f33 [YanjieGao] resolve conflict and add annotation on basicOperator and remove HiveQl
0c7cca5 [YanjieGao] modify format problem
a802ca8 [YanjieGao] Merge remote branch 'upstream/master' into patch-5
5e374c7 [YanjieGao] resolve conflict in SparkStrategies and basicOperator
f7961f6 [Yanjie Gao] update the line less than
bdc4a05 [Yanjie Gao] Update basicOperators.scala
0b49837 [Yanjie Gao] delete the annotation
f1288b4 [Yanjie Gao] delete annotation
e2b64be [Yanjie Gao] Update basicOperators.scala
4dd453e [Yanjie Gao] Update SQLQuerySuite.scala
790765d [Yanjie Gao] Update SparkStrategies.scala
ac73e60 [Yanjie Gao] Update basicOperators.scala
d4ac5e5 [Yanjie Gao] Update HiveQl.scala
61e88e7 [Yanjie Gao] Update SqlParser.scala
469f099 [Yanjie Gao] Update basicOperators.scala
e5bff61 [Yanjie Gao] Spark SQL basicOperator add Intersect operator

50561f43

[SPARK-2376][SQL] Selecting list values inside nested JSON objects raises... · 4352a2fd

Yin Huai authored 10 years ago

[SPARK-2376][SQL] Selecting list values inside nested JSON objects raises java.lang.IllegalArgumentException

JIRA: https://issues.apache.org/jira/browse/SPARK-2376

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1320 from yhuai/SPARK-2376 and squashes the following commits:

0107417 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2376
480803d [Yin Huai] Correctly handling JSON arrays in PySpark.

4352a2fd

[SPARK-2375][SQL] JSON schema inference may not resolve type conflicts... · f0496ee1

Yin Huai authored 10 years ago

[SPARK-2375][SQL] JSON schema inference may not resolve type conflicts correctly for a field inside an array of structs

For example, for
```
{"array": [{"field":214748364700}, {"field":1}]}
```
the type of field is resolved as IntType. While, for
```
{"array": [{"field":1}, {"field":214748364700}]}
```
the type of field is resolved as LongType.

JIRA: https://issues.apache.org/jira/browse/SPARK-2375

Author: Yin Huai <huaiyin.thu@gmail.com>

Closes #1308 from yhuai/SPARK-2375 and squashes the following commits:

3e2e312 [Yin Huai] Update unit test.
1b2ff9f [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2375
10794eb [Yin Huai] Correctly resolve the type of a field inside an array of structs.

f0496ee1

[SPARK-2386] [SQL] RowWriteSupport should use the exact types to cast. · 4deeed17

Takuya UESHIN authored 10 years ago

When execute `saveAsParquetFile` with non-primitive type, `RowWriteSupport` uses wrong type `Int` for `ByteType` and `ShortType`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1315 from ueshin/issues/SPARK-2386 and squashes the following commits:

20d89ec [Takuya UESHIN] Use None instead of null.
bd88741 [Takuya UESHIN] Add a test.
323d1d2 [Takuya UESHIN] Modify RowWriteSupport to use the exact types to cast.

4deeed17

[SPARK-2339][SQL] SQL parser in sql-core is case sensitive, but a table alias... · c0b4cf09

Yin Huai authored 10 years ago

[SPARK-2339][SQL] SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery

Reported by http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html
After we get the table from the catalog, because the table has an alias, we will temporarily insert a Subquery. Then, we convert the table alias to lower case no matter if the parser is case sensitive or not.
To see the issue ...
```
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD

case class Person(name: String, age: Int)

val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.registerAsTable("people")

sqlContext.sql("select PEOPLE.name from people PEOPLE")
```
The plan is ...
```
== Query Plan ==
Project ['PEOPLE.name]
 ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:176
```
You can find that `PEOPLE.name` is not resolved.

This PR introduces three changes.
1.  If a table has an alias, the catalog will not lowercase the alias. If a lowercase alias is needed, the analyzer will do the work.
2.  A catalog has a new val caseSensitive that indicates if this catalog is case sensitive or not. For example, a SimpleCatalog is case sensitive, but
3.  Corresponding unit tests.
With this PR, case sensitivity of database names and table names is handled by the catalog. Case sensitivity of other identifiers are handled by the analyzer.

JIRA: https://issues.apache.org/jira/browse/SPARK-2339

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1317 from yhuai/SPARK-2339 and squashes the following commits:

12d8006 [Yin Huai] Handling case sensitivity correctly. This patch introduces three changes. 1. If a table has an alias, the catalog will not lowercase the alias. If a lowercase alias is needed, the analyzer will do the work. 2. A catalog has a new val caseSensitive that indicates if this catalog is case sensitive or not. For example, a SimpleCatalog is case sensitive, but 3. Corresponding unit tests. With this patch, case sensitivity of database names and table names is handled by the catalog. Case sensitivity of other identifiers is handled by the analyzer.

c0b4cf09

[SPARK-1977][MLLIB] register mutable BitSet in MovieLenseALS · f7ce1b3b

Neville Li authored 10 years ago

Author: Neville Li <neville@spotify.com>

Closes #1319 from nevillelyh/gh/SPARK-1977 and squashes the following commits:

1f0a355 [Neville Li] [SPARK-1977][MLLIB] register mutable BitSet in MovieLenseALS

f7ce1b3b

Jul 05, 2014

[SPARK-2327] [SQL] Fix nullabilities of Join/Generate/Aggregate. · 9d5ecf82

Takuya UESHIN authored 10 years ago

Fix nullabilities of `Join`/`Generate`/`Aggregate` because:
- Output attributes of opposite side of `OuterJoin` should be nullable.
- Output attributes of generater side of `Generate` should be nullable if `join` is `true` and `outer` is `true`.
- `AttributeReference` of `computedAggregates` of `Aggregate` should be the same as `aggregateExpression`'s.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1266 from ueshin/issues/SPARK-2327 and squashes the following commits:

3ace83a [Takuya UESHIN] Add withNullability to Attribute and use it to change nullabilities.
df1ae53 [Takuya UESHIN] Modify nullabilize to leave attribute if not resolved.
799ce56 [Takuya UESHIN] Add nullabilization to Generate of SparkPlan.
a0fc9bc [Takuya UESHIN] Fix scalastyle errors.
0e31e37 [Takuya UESHIN] Fix Aggregate resultAttribute nullabilities.
09532ec [Takuya UESHIN] Fix Generate output nullabilities.
f20f196 [Takuya UESHIN] Fix Join output nullabilities.

9d5ecf82