- Sep 02, 2014
-
-
Patrick Wendell authored
During regression tests of Spark 1.1 we discovered perf issues with PVM instances when running PySpark. This reverts a change added in #1156 which changed the default type for m3 instances to PVM. Author: Patrick Wendell <pwendell@gmail.com> Closes #2244 from pwendell/ec2-hvm and squashes the following commits: 1342d7e [Patrick Wendell] SPARK-3358: [EC2] Switch back to HVM instances for m3.X.
-
Liang-Chi Hsieh authored
The function `ensureFreeSpace` in object `ColumnBuilder` clears old buffer before copying its content to new buffer. This PR fixes it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #2195 from viirya/fix_buffer_clear and squashes the following commits: 792f009 [Liang-Chi Hsieh] no need to call clear(). use flip() instead of calling limit(), position() and rewind(). df2169f [Liang-Chi Hsieh] should clean old buffer after copying its content.
-
Cheng Lian authored
Class names of these two are just too similar. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2189 from liancheng/column-metrics and squashes the following commits: 8bb3b21 [Cheng Lian] Renamed ColumnStat to ColumnMetrics to avoid confusion between ColumnStats
-
Takuya UESHIN authored
Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #2233 from ueshin/issues/SPARK-3341 and squashes the following commits: e497320 [Takuya UESHIN] Fix data type of Sqrt expression.
-
luluorta authored
If the users set “spark.default.parallelism” and the value is different with the EdgeRDD partition number, GraphX jobs will throw: java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions Author: luluorta <luluorta@gmail.com> Closes #1763 from luluorta/fix-graph-zip and squashes the following commits: 8338961 [luluorta] fix GraphX EdgeRDD zipPartitions
-
Tathagata Das authored
- Include kinesis in the unidocs - Hide non-public classes from docs Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #2239 from tdas/kinesis-doc-fix and squashes the following commits: 156e20c [Tathagata Das] More fixes, based on PR comments. e9a6c01 [Tathagata Das] Fixed docs related to kinesis
-
Larry Xiao authored
minor fix detail is here: https://issues.apache.org/jira/browse/SPARK-2981 Author: Larry Xiao <xiaodi@sjtu.edu.cn> Closes #1902 from larryxiao/2981 and squashes the following commits: 88059a2 [Larry Xiao] [SPARK-2981][GraphX] EdgePartition1D Int overflow
-
uncleGen authored
[SPARK-3123][GraphX]: override the "setName" function to set EdgeRDD's name manually just as VertexRDD does. Author: uncleGen <hustyugm@gmail.com> Closes #2033 from uncleGen/master_origin and squashes the following commits: 801994b [uncleGen] Update EdgeRDD.scala
-
Larry Xiao authored
to support ~/spark/bin/run-example GraphXAnalytics triangles /soc-LiveJournal1.txt --numEPart=256 Author: Larry Xiao <xiaodi@sjtu.edu.cn> Closes #1766 from larryxiao/1986 and squashes the following commits: bb77cd9 [Larry Xiao] [SPARK-1986][GraphX]move lib.Analytics to org.apache.spark.examples
-
Prudhvi Krishna authored
Directory path for dependencies jar and resources in Tachyon 0.5.0 has been changed. Author: Prudhvi Krishna <prudhvi953@gmail.com> Closes #2228 from prudhvije/SPARK-3328/make-dist-fix and squashes the following commits: d1d2c22 [Prudhvi Krishna] SPARK-3328 fixed make-distribution script --with-tachyon option.
-
Davies Liu authored
RDD.countApproxDistinct(relativeSD=0.05): :: Experimental :: Return approximate number of distinct elements in the RDD. The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>. This support all the types of objects, which is supported by Pyrolite, nearly all builtin types. param relativeSD Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017. >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct() >>> 950 < n < 1050 True >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct() >>> 18 < n < 22 True Author: Davies Liu <davies.liu@gmail.com> Closes #2142 from davies/countApproxDistinct and squashes the following commits: e20da47 [Davies Liu] remove the correction in Python c38c4e4 [Davies Liu] fix doc tests 2ab157c [Davies Liu] fix doc tests 9d2565f [Davies Liu] add commments and link for hash collision correction d306492 [Davies Liu] change range of hash of tuple to [0, maxint] ded624f [Davies Liu] calculate hash in Python 4cba98f [Davies Liu] add more tests a85a8c6 [Davies Liu] Merge branch 'master' into countApproxDistinct e97e342 [Davies Liu] add countApproxDistinct()
-
Sandy Ryza authored
...job fails while reading from Hadoop Author: Sandy Ryza <sandy@cloudera.com> Closes #1956 from sryza/sandy-spark-3052 and squashes the following commits: 815813a [Sandy Ryza] SPARK-3052. Misleading and spurious FileSystem closed errors whenever a job fails while reading from Hadoop
-
Marcelo Vanzin authored
Missing import. Oops. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #2236 from vanzin/SPARK-3347 and squashes the following commits: 594fc39 [Marcelo Vanzin] [SPARK-3347] [yarn] Fix yarn-alpha compilation.
-
Andrew Or authored
We were trying to add `file:/C:/path/to/my.jar` to the class path. We should add `C:/path/to/my.jar` instead. Tested on Windows 8.1. Author: Andrew Or <andrewor14@gmail.com> Closes #2211 from andrewor14/windows-shell-jars and squashes the following commits: 262c6a2 [Andrew Or] Oops... Add the new code to the correct place 0d5a0c1 [Andrew Or] Format jar path only for adding to shell classpath 42bd626 [Andrew Or] Remove unnecessary code 0049f1b [Andrew Or] Remove embarrassing log messages b1755a0 [Andrew Or] Format jar paths properly before adding them to the classpath
-
Josh Rosen authored
The Maven build was failing on Windows because it tried to call the unix `unzip` utility to extract the Py4J files into core's build directory. I've fixed this issue by using the `maven-antrun-plugin` to perform the unzipping. I also fixed an issue that prevented tests from running under Windows: In the Maven ScalaTest plugin, the filename listed in <filereports> is placed under the <reportsDirectory>; the current code places it in a subdirectory of reportsDirectory, e.g. ``` ${project.build.directory}/surefire-reports/${project.build.directory}/SparkTestSuite.txt ``` This caused problems under Windows because it would try to create a subdirectory named "c:\\". Note that the tests still fail under Windows (for other reasons); this PR just allows them to run and fail rather than crash when trying to create the test reports directory. Author: Josh Rosen <joshrosen@apache.org> Author: Josh Rosen <rosenville@gmail.com> Author: Josh Rosen <joshrosen@databricks.com> Closes #2165 from JoshRosen/windows-support and squashes the following commits: 651d210 [Josh Rosen] Unzip to python/build instead of core/build fbf3e61 [Josh Rosen] 4 spaces -> 2 spaces e347668 [Josh Rosen] Fix Maven scalatest filereports path: 4994af1 [Josh Rosen] [SPARK-3061] Use maven-antrun-plugin to unzip Py4J.
-
Sean Owen authored
PEP8 tests run on files under "./python", but unzipped py4j code is found at "./python/build/py4j". Py4J code fails style checks and can fail ./dev/run-tests if this code is present locally. Author: Sean Owen <sowen@cloudera.com> Closes #2222 from srowen/SPARK-3331 and squashes the following commits: 34711ec [Sean Owen] Restrict lint check to pyspark/, since the local directory can contain unzipped py4j code in build/py4j
-
Reza Zadeh authored
Kill this bug fast before it does damage. Author: Reza Zadeh <rizlar@gmail.com> Closes #2224 from rezazadeh/indexrmbug and squashes the following commits: 53386d6 [Reza Zadeh] Squash bug in IndexedRowMatrix
-
lirui authored
This PR adds the async actions to the Java API. User can call these async actions to get the FutureAction and use JobWaiter (for SimpleFutureAction) to retrieve job Id. Author: lirui <rui.li@intel.com> Closes #2176 from lirui-intel/SPARK-2636 and squashes the following commits: ccaafb7 [lirui] SPARK-2636: fix java doc 5536d55 [lirui] SPARK-2636: mark the async API as experimental e2e01d5 [lirui] SPARK-2636: add mima exclude 0ca320d [lirui] SPARK-2636: fix method name & javadoc 3fa39f7 [lirui] SPARK-2636: refine the patch af4f5d9 [lirui] SPARK-2636: remove unused imports 843276c [lirui] SPARK-2636: only keep foreachAsync in the java API fbf5744 [lirui] SPARK-2636: add more async actions for java api 1b25abc [lirui] SPARK-2636: expose some fields in JobWaiter d09f732 [lirui] SPARK-2636: fix build eb1ee79 [lirui] SPARK-2636: change some parameters in SimpleFutureAction to member field 6e2b87b [lirui] SPARK-2636: add java API for async actions
-
Daniel Darabos authored
On `m3.2xlarge` instances the 2x80GB SSDs are inaccessible if not added to the block device mapping when the instance is created. They work when added with this patch. I have not tested this with other instance types, and I do not know much about this script and EC2 deployment in general. Maybe this code needs to depend on the instance type. The requirement for this mapping is described in the AWS docs at: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#InstanceStore_UsageScenarios "For M3 instances, you must specify instance store volumes in the block device mapping for the instance. When you launch an M3 instance, we ignore any instance store volumes specified in the block device mapping for the AMI." Author: Daniel Darabos <darabos.daniel@gmail.com> Closes #2081 from darabos/patch-1 and squashes the following commits: 1ceb2c8 [Daniel Darabos] Use %d string interpolation instead of {}. a1854d7 [Daniel Darabos] Only specify ephemeral device mapping for M3. e0d9e37 [Daniel Darabos] Create ephemeral device mapping based on get_num_disks(). 6b116a6 [Daniel Darabos] Add SSDs to block device mapping
-
- Sep 01, 2014
-
-
Reynold Xin authored
This also enables supporting broadcast variables larger than 2G. Author: Reynold Xin <rxin@apache.org> Closes #2054 from rxin/ByteArrayChunkOutputStream and squashes the following commits: 618d9c8 [Reynold Xin] Code review. 93f5a51 [Reynold Xin] Added comments. ee88e73 [Reynold Xin] to -> until bbd1cb1 [Reynold Xin] Renamed a variable. 36f4d01 [Reynold Xin] Sort imports. 8f1a8eb [Reynold Xin] [SPARK-3135] Created ByteArrayChunkOutputStream and used it to avoid memory copy in TorrentBroadcast.
-
Patrick Wendell authored
This commit exists to close the following pull requests on Github: Closes #1696 (close requested by 'pwendell') Closes #1384 (close requested by 'pwendell') Closes #845 (close requested by 'pwendell') Closes #81 (close requested by 'pwendell') Closes #1528 (close requested by 'pwendell') Closes #1018 (close requested by 'pwendell')
-
- Aug 31, 2014
-
-
scwf authored
https://issues.apache.org/jira/browse/SPARK-3010 this pr is to fix redundant conditional in spark, such as 1. private[spark] def codegenEnabled: Boolean = if (getConf(CODEGEN_ENABLED, "false") == "true") true else false 2. x => if (x == 2) true else false ... Author: scwf <wangfei1@huawei.com> Author: wangfei <wangfei_hello@126.com> Closes #1992 from scwf/condition and squashes the following commits: b2a044a [scwf] merge SecurityManager e16239c [scwf] fix confilct 6811401 [scwf] fix merge confilct 0824df4 [scwf] Merge branch 'master' of https://github.com/apache/spark into patch-4 e274515 [scwf] fix redundant conditions d032bf9 [wangfei] [SQL]Excess judgment
-
- Aug 30, 2014
-
-
Nicholas Chammas authored
Look only at code files (`.py`, `.java`, and `.scala`) for new classes. Should get rid of false alarms like [the one reported here](https://github.com/apache/spark/pull/2014#issuecomment-52912040). Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #2184 from nchammas/jenkins-ignore-noncode and squashes the following commits: 33786ac [Nicholas Chammas] break up long line 3f91a14 [Nicholas Chammas] rename array of source files 8b82a26 [Nicholas Chammas] [Spark QA] only check code files for new classes
-
Patrick Wendell authored
This commit exists to close the following pull requests on Github: Closes #1922 (close requested by 'JoshRosen') Closes #1356 (close requested by 'pwendell') Closes #1698 (close requested by 'mengxr') Closes #254 (close requested by 'mateiz') Closes #2135 (close requested by 'andrewor14')
-
Holden Karau authored
Rather than specifying the path to SparkFiles we need to use the filename. Author: Holden Karau <holden@pigscanfly.ca> Closes #2210 from holdenk/SPARK-3318-documentation-for-addfiles-should-say-to-use-file-not-path and squashes the following commits: a25d27a [Holden Karau] Update the JavaSparkContext addFile method to be clear about using fileName with SparkFiles as well 0ebcb05 [Holden Karau] Documentation update in addFile on how to use SparkFiles.get to specify filename rather than path
-
Marcelo Vanzin authored
Different places in the code were instantiating Configuration / YarnConfiguration objects in different ways. This could lead to confusion for people who actually expected "spark.hadoop.*" options to end up in the configs used by Spark code, since that would only happen for the SparkContext's config. This change modifies most places to use SparkHadoopUtil to initialize configs, and make that method do the translation that previously was only done inside SparkContext. The places that were not changed fall in one of the following categories: - Test code where this doesn't really matter - Places deep in the code where plumbing SparkConf would be too difficult for very little gain - Default values for arguments - since the caller can provide their own config in that case Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #1843 from vanzin/SPARK-2889 and squashes the following commits: 52daf35 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889 f179013 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889 51e71cf [Marcelo Vanzin] Add test to ensure that overriding Yarn configs works. 53f9506 [Marcelo Vanzin] Add DeveloperApi annotation. 3d345cb [Marcelo Vanzin] Restore old method for backwards compat. fc45067 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889 0ac3fdf [Marcelo Vanzin] Merge branch 'master' into SPARK-2889 3f26760 [Marcelo Vanzin] Compilation fix. f16cadd [Marcelo Vanzin] Initialize config in SparkHadoopUtil. b8ab173 [Marcelo Vanzin] Update Utils API to take a Configuration argument. 1e7003f [Marcelo Vanzin] Replace explicit Configuration instantiation with SparkHadoopUtil.
-
Reynold Xin authored
Closes #1824
-
Raymond Liu authored
By Hiding the shuffleblockmanager behind Shufflemanager, we decouple the shuffle data's block mapping management work from Diskblockmananger. This give a more clear interface and more easy for other shuffle manager to implement their own block management logic. the jira ticket have more details. Author: Raymond Liu <raymond.liu@intel.com> Closes #1241 from colorant/shuffle and squashes the following commits: 0e01ae3 [Raymond Liu] Move ShuffleBlockmanager behind shuffleManager
-
Kousuke Saruta authored
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2200 from sarutak/SPARK-3305 and squashes the following commits: 3cbd6ee [Kousuke Saruta] Removed unused import from classes related to UI
-
Patrick Wendell authored
-
- Aug 29, 2014
-
-
Cheng Lian authored
[SPARK-3320][SQL] Made batched in-memory column buffer building work for SchemaRDDs with empty partitions Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2213 from liancheng/spark-3320 and squashes the following commits: 45a0139 [Cheng Lian] Fixed typo in InMemoryColumnarQuerySuite f67067d [Cheng Lian] Fixed SPARK-3320
-
wangfei authored
[SPARK-3296][mllib] spark-example should be run-example in head notation of DenseKMeans and SparseNaiveBayes `./bin/spark-example` should be `./bin/run-example` in DenseKMeans and SparseNaiveBayes Author: wangfei <wangfei_hello@126.com> Closes #2193 from scwf/run-example and squashes the following commits: 207eb3a [wangfei] spark-example should be run-example 27a8999 [wangfei] ./bin/spark-example should be ./bin/run-example
-
Zdenek Farana authored
If you have a table with TIMESTAMP column, that column can't be used in WHERE clause properly - it is not evaluated properly. [More](https://issues.apache.org/jira/browse/SPARK-3173) Motivation: http://www.aproint.com/aggregation-with-spark-sql/ - [x] modify SqlParser so it supports casting to TIMESTAMP (workaround for item 2) - [x] the string literal should be converted into Timestamp if the column is Timestamp. Author: Zdenek Farana <zdenek.farana@gmail.com> Author: Zdenek Farana <zdenek.farana@aproint.com> Closes #2084 from byF/SPARK-3173 and squashes the following commits: 442b59d [Zdenek Farana] Fixed test merge conflict 2dbf4f6 [Zdenek Farana] Merge remote-tracking branch 'origin/SPARK-3173' into SPARK-3173 65b6215 [Zdenek Farana] Fixed timezone sensitivity in the test 47b27b4 [Zdenek Farana] Now works in the case of "StringLiteral=TimestampColumn" 96a661b [Zdenek Farana] Code style change 491dfcf [Zdenek Farana] Added test cases for SPARK-3173 4446b1e [Zdenek Farana] A string literal is casted into Timestamp when the column is Timestamp. 59af397 [Zdenek Farana] Added a new TIMESTAMP keyword; CAST to TIMESTAMP now can be used in SQL expression.
-
qiping.lqp authored
":" is not allowed to appear in a file name of Windows system. If file name contains ":", this file can't be checked out in a Windows system and developers using Windows must be careful to not commit the deletion of such files, Which is very inconvenient. Author: qiping.lqp <qiping.lqp@alibaba-inc.com> Closes #2191 from chouqin/querytest and squashes the following commits: 0e943a1 [qiping.lqp] rename golden file 60a863f [qiping.lqp] TestcaseName in createQueryTest should not contain ":"
-
Cheng Lian authored
When a large batch size is specified, `SparkSQLOperationManager` OOMs even if the whole result set is much smaller than the batch size. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2171 from liancheng/jdbc-fetch-size and squashes the following commits: 5e1623b [Cheng Lian] Decreases initial buffer size for row set to prevent OOM
-
Cheng Lian authored
`HiveCompatibilitySuite` already turns on in-memory columnar caching, it would be good to also enable compression to improve test coverage. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2190 from liancheng/compression-on and squashes the following commits: 88b536c [Cheng Lian] Code cleanup, narrowed field visibility d13efd2 [Cheng Lian] Turns on in-memory columnar compression in HiveCompatibilitySuite
-
Cheng Hao authored
Thus id property of the TreeNode API does save time in a faster way to compare 2 TreeNodes, it is kind of performance bottleneck during the expression object creation in a multi-threading env (because of the memory barrier). Fortunately, the tree node comparison only happen once in master, so even we remove it, the entire performance will not be affected. Author: Cheng Hao <hao.cheng@intel.com> Closes #2155 from chenghao-intel/treenode and squashes the following commits: 7cf2cd2 [Cheng Hao] Remove the implicit keyword for TreeNodeRef and some other small issues 5873415 [Cheng Hao] Remove the TreeNode.id
-
Cheng Lian authored
[SPARK-3234][Build] Fixed environment variables that rely on deprecated command line options in make-distribution.sh Please refer to [SPARK-3234](https://issues.apache.org/jira/browse/SPARK-3234) for details. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2208 from liancheng/spark-3234 and squashes the following commits: fb26de8 [Cheng Lian] Fixed SPARK-3234
-
William Benton authored
This PR adds a native implementation for SQL SQRT() and thus avoids delegating this function to Hive. Author: William Benton <willb@redhat.com> Closes #1750 from willb/spark-2813 and squashes the following commits: 22c8a79 [William Benton] Fixed missed newline from rebase d673861 [William Benton] Added string coercions for SQRT and associated test case e125df4 [William Benton] Added ExpressionEvaluationSuite test cases for SQRT 7b84bcd [William Benton] SQL SQRT now properly returns NULL for NULL inputs 8256971 [William Benton] added SQRT test to SqlQuerySuite 504d2e5 [William Benton] Added native SQRT implementation
-
Nicholas Chammas authored
As [reported on the dev list](http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC2-tp8107p8131.html): * Code fencing with triple-backticks doesn’t seem to work like it does on GitHub. Newlines are lost. Instead, use 4-space indent to format small code blocks. * Nested bullets need 2 leading spaces, not 1. * Spellcheck! Author: Nicholas Chammas <nicholas.chammas@gmail.com> Author: nchammas <nicholas.chammas@gmail.com> Closes #2201 from nchammas/sql-doc-fixes and squashes the following commits: 873f889 [Nicholas Chammas] [Docs] fix skip-api flag 5195e0c [Nicholas Chammas] [Docs] SQL doc formatting and typo fixes 3b26c8d [nchammas] [Spark QA] Link to console output on test time out
-