Skip to content
Snippets Groups Projects
  1. Aug 26, 2014
    • Davies Liu's avatar
      [SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey() · f1e71d4c
      Davies Liu authored
      Using external sort to support sort large datasets in reduce stage.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1978 from davies/sort and squashes the following commits:
      
      bbcd9ba [Davies Liu] check spilled bytes in tests
      b125d2f [Davies Liu] add test for external sort in rdd
      eae0176 [Davies Liu] choose different disks from different processes and instances
      1f075ed [Davies Liu] Merge branch 'master' into sort
      eb53ca6 [Davies Liu] Merge branch 'master' into sort
      644abaf [Davies Liu] add license in LICENSE
      19f7873 [Davies Liu] improve tests
      55602ee [Davies Liu] use external sort in sortBy() and sortByKey()
      f1e71d4c
    • Michael Armbrust's avatar
      [SPARK-3194][SQL] Add AttributeSet to fix bugs with invalid comparisons of AttributeReferences · c4787a36
      Michael Armbrust authored
      It is common to want to describe sets of attributes that are in various parts of a query plan.  However, the semantics of putting `AttributeReference` objects into a standard Scala `Set` result in subtle bugs when references differ cosmetically.  For example, with case insensitive resolution it is possible to have two references to the same attribute whose names are not equal.
      
      In this PR I introduce a new abstraction, an `AttributeSet`, which performs all comparisons using the globally unique `ExpressionId` instead of case class equality.  (There is already a related class, [`AttributeMap`](https://github.com/marmbrus/spark/blob/inMemStats/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala#L32))  This new type of set is used to fix a bug in the optimizer where needed attributes were getting projected away underneath join operators.
      
      I also took this opportunity to refactor the expression and query plan base classes.  In all but one instance the logic for computing the `references` of an `Expression` were the same.  Thus, I moved this logic into the base class.
      
      For query plans the semantics of  the `references` method were ill defined (is it the references output? or is it those used by expression evaluation? or what?).  As a result, this method wasn't really used very much.  So, I removed it.
      
      TODO:
       - [x] Finish scala doc for `AttributeSet`
       - [x] Scan the code for other instances of `Set[Attribute]` and refactor them.
       - [x] Finish removing `references` from `QueryPlan`
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2109 from marmbrus/attributeSets and squashes the following commits:
      
      1c0dae5 [Michael Armbrust] work on serialization bug.
      9ba868d [Michael Armbrust] Merge remote-tracking branch 'origin/master' into attributeSets
      3ae5288 [Michael Armbrust] review comments
      40ce7f6 [Michael Armbrust] style
      d577cc7 [Michael Armbrust] Scaladoc
      cae5d22 [Michael Armbrust] remove more references implementations
      d6e16be [Michael Armbrust] Remove more instances of "def references" and normal sets of attributes.
      fc26b49 [Michael Armbrust] Add AttributeSet class, remove references from Expression.
      c4787a36
    • Burak's avatar
      [SPARK-2839][MLlib] Stats Toolkit documentation updated · 1208f72a
      Burak authored
      Documentation updated for the Statistics Toolkit of MLlib. mengxr atalwalkar
      
      https://issues.apache.org/jira/browse/SPARK-2839
      
      P.S. Accidentally closed #2123. New commits didn't show up after I reopened the PR. I've opened this instead and closed the old one.
      
      Author: Burak <brkyvz@gmail.com>
      
      Closes #2130 from brkyvz/StatsLib-Docs and squashes the following commits:
      
      a54a855 [Burak] [SPARK-2839][MLlib] Addressed comments
      bfc6896 [Burak] [SPARK-2839][MLlib] Added a more specific link to colStats() for pyspark
      213fe3f [Burak] [SPARK-2839][MLlib] Modifications made according to review
      fec4d9d [Burak] [SPARK-2830][MLlib] Stats Toolkit documentation updated
      1208f72a
    • Xiangrui Meng's avatar
      [SPARK-3226][MLLIB] doc update for native libraries · adbd5c16
      Xiangrui Meng authored
      to mention `-Pnetlib-lgpl` option. atalwalkar
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2128 from mengxr/mllib-native and squashes the following commits:
      
      4cbba57 [Xiangrui Meng] update mllib dependencies
      adbd5c16
    • Takuya UESHIN's avatar
      [SPARK-3063][SQL] ExistingRdd should convert Map to catalyst Map. · 6b5584ef
      Takuya UESHIN authored
      Currently `ExistingRdd.convertToCatalyst` doesn't convert `Map` value.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #1963 from ueshin/issues/SPARK-3063 and squashes the following commits:
      
      3ba41f2 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063
      4d7bae2 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063
      9321379 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063
      d8a900a [Takuya UESHIN] Make ExistingRdd.convertToCatalyst be able to convert Map value.
      6b5584ef
    • Takuya UESHIN's avatar
      [SPARK-2969][SQL] Make ScalaReflection be able to handle... · 98c2bb0b
      Takuya UESHIN authored
      [SPARK-2969][SQL] Make ScalaReflection be able to handle ArrayType.containsNull and MapType.valueContainsNull.
      
      Make `ScalaReflection` be able to handle like:
      
      - `Seq[Int]` as `ArrayType(IntegerType, containsNull = false)`
      - `Seq[java.lang.Integer]` as `ArrayType(IntegerType, containsNull = true)`
      - `Map[Int, Long]` as `MapType(IntegerType, LongType, valueContainsNull = false)`
      - `Map[Int, java.lang.Long]` as `MapType(IntegerType, LongType, valueContainsNull = true)`
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #1889 from ueshin/issues/SPARK-2969 and squashes the following commits:
      
      24f1c5c [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Python API.
      79f5b65 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Java API.
      7cd1a7a [Takuya UESHIN] Fix json test failures.
      2cfb862 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true.
      2f38e61 [Takuya UESHIN] Revert the default value of MapTypes.valueContainsNull.
      9fa02f5 [Takuya UESHIN] Fix a test failure.
      1a9a96b [Takuya UESHIN] Modify ScalaReflection to handle ArrayType.containsNull and MapType.valueContainsNull.
      98c2bb0b
    • Davies Liu's avatar
      [SPARK-2871] [PySpark] add histgram() API · 3cedc4f4
      Davies Liu authored
      RDD.histogram(buckets)
      
              Compute a histogram using the provided buckets. The buckets
              are all open to the right except for the last which is closed.
              e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
              which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
              and 50 we would have a histogram of 1,0,1.
      
              If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
              this can be switched from an O(log n) inseration to O(1) per
              element(where n = # buckets).
      
              Buckets must be sorted and not contain any duplicates, must be
              at least two elements.
      
              If `buckets` is a number, it will generates buckets which is
              evenly spaced between the minimum and maximum of the RDD. For
              example, if the min value is 0 and the max is 100, given buckets
              as 2, the resulting buckets will be [0,50) [50,100]. buckets must
              be at least 1 If the RDD contains infinity, NaN throws an exception
              If the elements in RDD do not vary (max == min) always returns
              a single bucket.
      
              It will return an tuple of buckets and histogram.
      
              >>> rdd = sc.parallelize(range(51))
              >>> rdd.histogram(2)
              ([0, 25, 50], [25, 26])
              >>> rdd.histogram([0, 5, 25, 50])
              ([0, 5, 25, 50], [5, 20, 26])
              >>> rdd.histogram([0, 15, 30, 45, 60], True)
              ([0, 15, 30, 45, 60], [15, 15, 15, 6])
              >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"])
              >>> rdd.histogram(("a", "b", "c"))
              (('a', 'b', 'c'), [2, 2])
      
      closes #122, it's duplicated.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2091 from davies/histgram and squashes the following commits:
      
      a322f8a [Davies Liu] fix deprecation of e.message
      84e85fa [Davies Liu] remove evenBuckets, add more tests (including str)
      d9a0722 [Davies Liu] address comments
      0e18a2d [Davies Liu] add histgram() API
      3cedc4f4
    • chutium's avatar
      [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext · 8856c3d8
      chutium authored
      There are 4 different compression codec available for ```ParquetOutputFormat```
      
      in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression```
      
      original discuss:
      https://github.com/apache/spark/pull/195#discussion-diff-11002083
      
      i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 #1873 (https://github.com/apache/spark/pull/1873/files#diff-0)
      
      btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, #1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632).
      
      Author: chutium <teng.qiu@gmail.com>
      
      Closes #2039 from chutium/parquet-compression and squashes the following commits:
      
      2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite
      e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy
      21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext
      8856c3d8
    • Andrew Or's avatar
      [SPARK-2886] Use more specific actor system name than "spark" · b21ae5bb
      Andrew Or authored
      As of #1777 we log the name of the actor system when it binds to a port. The current name "spark" is super general and does not convey any meaning. For instance, the following line is taken from my driver log after setting `spark.driver.port` to 5001.
      ```
      14/08/13 19:33:29 INFO Remoting: Remoting started; listening on addresses:
      [akka.tcp://sparkandrews-mbp:5001]
      14/08/13 19:33:29 INFO Remoting: Remoting now listens on addresses:
      [akka.tcp://sparkandrews-mbp:5001]
      14/08/06 13:40:05 INFO Utils: Successfully started service 'spark' on port 5001.
      ```
      This commit renames this to "sparkDriver" and "sparkExecutor". The goal of this unambitious PR is simply to make the logged information more explicit without introducing any change in functionality.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1810 from andrewor14/service-name and squashes the following commits:
      
      8c459ed [Andrew Or] Use a common variable for driver/executor actor system names
      3a92843 [Andrew Or] Change actor name to sparkDriver and sparkExecutor
      921363e [Andrew Or] Merge branch 'master' of github.com:apache/spark into service-name
      c8c6a62 [Andrew Or] Do not include hyphens in actor name
      1c1b42e [Andrew Or] Avoid spaces in akka system name
      f644b55 [Andrew Or] Use more specific service name
      b21ae5bb
    • Daoyuan Wang's avatar
      [Spark-3222] [SQL] Cross join support in HiveQL · 52fbdc2d
      Daoyuan Wang authored
      We can simple treat cross join as inner join without join conditions.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      Author: adrian-wang <daoyuanwong@gmail.com>
      
      Closes #2124 from adrian-wang/crossjoin and squashes the following commits:
      
      8c9b7c5 [Daoyuan Wang] add a test
      7d47bbb [adrian-wang] add cross join support for hql
      52fbdc2d
  2. Aug 25, 2014
    • Kousuke Saruta's avatar
      [SPARK-2976] Replace tabs with spaces · 62f5009f
      Kousuke Saruta authored
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #1895 from sarutak/SPARK-2976 and squashes the following commits:
      
      1cf7e69 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2976
      d1e0666 [Kousuke Saruta] Modified styles
      c5e80a4 [Kousuke Saruta] Remove tab from JavaPageRank.java and JavaKinesisWordCountASL.java
      c003b36 [Kousuke Saruta] Removed tab from sorttable.js
      62f5009f
    • witgo's avatar
      SPARK-2481: The environment variables SPARK_HISTORY_OPTS is covered in spark-env.sh · 9f04db17
      witgo authored
      Author: witgo <witgo@qq.com>
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #1341 from witgo/history_env and squashes the following commits:
      
      b4fd9f8 [GuoQiang Li] review commit
      0ebe401 [witgo] *-history-server.sh load spark-config.sh
      9f04db17
    • Chia-Yung Su's avatar
      [SPARK-3011][SQL] _temporary directory should be filtered out by sqlContext.parquetFile · 4243bb66
      Chia-Yung Su authored
      fix compile error on hadoop 0.23 for the pull request #1924.
      
      Author: Chia-Yung Su <chiayung@appier.com>
      
      Closes #1959 from joesu/bugfix-spark3011 and squashes the following commits:
      
      be30793 [Chia-Yung Su] remove .* and _* except _metadata
      8fe2398 [Chia-Yung Su] add note to explain
      40ea9bd [Chia-Yung Su] fix hadoop-0.23 compile error
      c7e44f2 [Chia-Yung Su] match syntax
      f8fc32a [Chia-Yung Su] filter out tmp dir
      4243bb66
    • wangfei's avatar
      [SQL] logWarning should be logInfo in getResultSetSchema · 507a1b52
      wangfei authored
      Author: wangfei <wangfei_hello@126.com>
      
      Closes #1939 from scwf/patch-5 and squashes the following commits:
      
      f952d10 [wangfei] [SQL] logWarning should be logInfo in getResultSetSchema
      507a1b52
    • Cheng Hao's avatar
      [SPARK-3058] [SQL] Support EXTENDED for EXPLAIN · 156eb396
      Cheng Hao authored
      Provide `extended` keyword support for `explain` command in SQL. e.g.
      ```
      explain extended select key as a1, value as a2 from src where key=1;
      == Parsed Logical Plan ==
      Project ['key AS a1#3,'value AS a2#4]
       Filter ('key = 1)
        UnresolvedRelation None, src, None
      
      == Analyzed Logical Plan ==
      Project [key#8 AS a1#3,value#9 AS a2#4]
       Filter (CAST(key#8, DoubleType) = CAST(1, DoubleType))
        MetastoreRelation default, src, None
      
      == Optimized Logical Plan ==
      Project [key#8 AS a1#3,value#9 AS a2#4]
       Filter (CAST(key#8, DoubleType) = 1.0)
        MetastoreRelation default, src, None
      
      == Physical Plan ==
      Project [key#8 AS a1#3,value#9 AS a2#4]
       Filter (CAST(key#8, DoubleType) = 1.0)
        HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None
      
      Code Generation: false
      == RDD ==
      (2) MappedRDD[14] at map at HiveContext.scala:350
        MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:42
        MapPartitionsRDD[12] at mapPartitions at basicOperators.scala:57
        MapPartitionsRDD[11] at mapPartitions at TableReader.scala:112
        MappedRDD[10] at map at TableReader.scala:240
        HadoopRDD[9] at HadoopRDD at TableReader.scala:230
      ```
      
      It's the sub task of #1847. But can go without any dependency.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #1962 from chenghao-intel/explain_extended and squashes the following commits:
      
      295db74 [Cheng Hao] Fix bug in printing the simple execution plan
      48bc989 [Cheng Hao] Support EXTENDED for EXPLAIN
      156eb396
    • Cheng Lian's avatar
      [SPARK-2929][SQL] Refactored Thrift server and CLI suites · cae9414d
      Cheng Lian authored
      Removed most hard coded timeout, timing assumptions and all `Thread.sleep`. Simplified IPC and synchronization with `scala.sys.process` and future/promise so that the test suites can run more robustly and faster.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1856 from liancheng/thriftserver-tests and squashes the following commits:
      
      2d914ca [Cheng Lian] Minor refactoring
      0e12e71 [Cheng Lian] Cleaned up test output
      0ee921d [Cheng Lian] Refactored Thrift server and CLI suites
      cae9414d
    • Takuya UESHIN's avatar
      [SPARK-3204][SQL] MaxOf would be foldable if both left and right are foldable. · d299e2bf
      Takuya UESHIN authored
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #2116 from ueshin/issues/SPARK-3204 and squashes the following commits:
      
      7d9b107 [Takuya UESHIN] Make MaxOf foldable if both left and right are foldable.
      d299e2bf
    • Cheng Lian's avatar
      Fixed a typo in docs/running-on-mesos.md · 805fec84
      Cheng Lian authored
      It should be `spark-env.sh` rather than `spark.env.sh`.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2119 from liancheng/fix-mesos-doc and squashes the following commits:
      
      f360548 [Cheng Lian] Fixed a typo in docs/running-on-mesos.md
      805fec84
    • Xiangrui Meng's avatar
      [FIX] fix error message in sendMessageReliably · fd8ace2d
      Xiangrui Meng authored
      rxin
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2120 from mengxr/sendMessageReliably and squashes the following commits:
      
      b14400c [Xiangrui Meng] fix error message in sendMessageReliably
      fd8ace2d
    • Allan Douglas R. de Oliveira's avatar
      SPARK-3180 - Better control of security groups · cc40a709
      Allan Douglas R. de Oliveira authored
      Adds the --authorized-address and --additional-security-group options as explained in the issue.
      
      Author: Allan Douglas R. de Oliveira <allan@chaordicsystems.com>
      
      Closes #2088 from douglaz/configurable_sg and squashes the following commits:
      
      e3e48ca [Allan Douglas R. de Oliveira] Adds the option to specify the address authorized to access the SG and another option to provide an additional existing SG
      cc40a709
    • Sean Owen's avatar
      SPARK-2798 [BUILD] Correct several small errors in Flume module pom.xml files · cd30db56
      Sean Owen authored
      (EDIT) Since the scalatest issue was since resolved, this is now about a few small problems in the Flume Sink `pom.xml`
      
      - `scalatest` is not declared as a test-scope dependency
      - Its Avro version doesn't match the rest of the build
      - Its Flume version is not synced with the other Flume module
      - The other Flume module declares its dependency on Flume Sink slightly incorrectly, hard-coding the Scala 2.10 version
      - It depends on Scala Lang directly, which it shouldn't
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #1726 from srowen/SPARK-2798 and squashes the following commits:
      
      a46e2c6 [Sean Owen] scalatest to test scope, harmonize Avro and Flume versions, remove direct Scala dependency, fix '2.10' in Flume dependency
      cd30db56
    • Xiangrui Meng's avatar
      [SPARK-2495][MLLIB] make KMeans constructor public · 220f4136
      Xiangrui Meng authored
      to re-construct k-means models freeman-lab
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2112 from mengxr/public-constructors and squashes the following commits:
      
      18d53a9 [Xiangrui Meng] make KMeans constructor public
      220f4136
  3. Aug 24, 2014
    • Davies Liu's avatar
      [SPARK-2871] [PySpark] add zipWithIndex() and zipWithUniqueId() · fb0db772
      Davies Liu authored
      RDD.zipWithIndex()
      
              Zips this RDD with its element indices.
      
              The ordering is first based on the partition index and then the
              ordering of items within each partition. So the first item in
              the first partition gets index 0, and the last item in the last
              partition receives the largest index.
      
              This method needs to trigger a spark job when this RDD contains
              more than one partitions.
      
              >>> sc.parallelize(range(4), 2).zipWithIndex().collect()
              [(0, 0), (1, 1), (2, 2), (3, 3)]
      
      RDD.zipWithUniqueId()
      
              Zips this RDD with generated unique Long ids.
      
              Items in the kth partition will get ids k, n+k, 2*n+k, ..., where
              n is the number of partitions. So there may exist gaps, but this
              method won't trigger a spark job, which is different from
              L{zipWithIndex}
      
              >>> sc.parallelize(range(4), 2).zipWithUniqueId().collect()
              [(0, 0), (2, 1), (1, 2), (3, 3)]
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2092 from davies/zipWith and squashes the following commits:
      
      cebe5bf [Davies Liu] improve test cases, reverse the order of index
      0d2a128 [Davies Liu] add zipWithIndex() and zipWithUniqueId()
      fb0db772
    • Reza Zadeh's avatar
      [MLlib][SPARK-2997] Update SVD documentation to reflect roughly square · b1b20301
      Reza Zadeh authored
      Update the documentation to reflect the fact we can handle roughly square matrices.
      
      Author: Reza Zadeh <rizlar@gmail.com>
      
      Closes #2070 from rezazadeh/svddocs and squashes the following commits:
      
      826b8fe [Reza Zadeh] left singular vectors
      3f34fc6 [Reza Zadeh] PCA is still TS
      7ffa2aa [Reza Zadeh] better title
      aeaf39d [Reza Zadeh] More docs
      788ed13 [Reza Zadeh] add computational cost explanation
      6429c59 [Reza Zadeh] Add link to rowmatrix docs
      1eeab8b [Reza Zadeh] Update SVD documentation to reflect roughly square
      b1b20301
    • DB Tsai's avatar
      [SPARK-2841][MLlib] Documentation for feature transformations · 572952ae
      DB Tsai authored
      Documentation for newly added feature transformations:
      1. TF-IDF
      2. StandardScaler
      3. Normalizer
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #2068 from dbtsai/transformer-documentation and squashes the following commits:
      
      109f324 [DB Tsai] address feedback
      572952ae
    • Kousuke Saruta's avatar
      [SPARK-3192] Some scripts have 2 space indentation but other scripts have 4 space indentation. · ded6796b
      Kousuke Saruta authored
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2104 from sarutak/SPARK-3192 and squashes the following commits:
      
      db78419 [Kousuke Saruta] Modified indentation of spark-shell
      ded6796b
  4. Aug 23, 2014
    • Raymond Liu's avatar
      Clean unused code in SortShuffleWriter · 8861cdf1
      Raymond Liu authored
      Just clean unused code which have been moved into ExternalSorter.
      
      Author: Raymond Liu <raymond.liu@intel.com>
      
      Closes #1882 from colorant/sortShuffleWriter and squashes the following commits:
      
      e6337be [Raymond Liu] Clean unused code in SortShuffleWriter
      8861cdf1
    • Davies Liu's avatar
      [SPARK-2871] [PySpark] add approx API for RDD · 8df4dad4
      Davies Liu authored
      RDD.countApprox(self, timeout, confidence=0.95)
      
              :: Experimental ::
              Approximate version of count() that returns a potentially incomplete
              result within a timeout, even if not all tasks have finished.
      
              >>> rdd = sc.parallelize(range(1000), 10)
              >>> rdd.countApprox(1000, 1.0)
              1000
      
      RDD.sumApprox(self, timeout, confidence=0.95)
      
              Approximate operation to return the sum within a timeout
              or meet the confidence.
      
              >>> rdd = sc.parallelize(range(1000), 10)
              >>> r = sum(xrange(1000))
              >>> (rdd.sumApprox(1000) - r) / r < 0.05
      
      RDD.meanApprox(self, timeout, confidence=0.95)
      
              :: Experimental ::
              Approximate operation to return the mean within a timeout
              or meet the confidence.
      
              >>> rdd = sc.parallelize(range(1000), 10)
              >>> r = sum(xrange(1000)) / 1000.0
              >>> (rdd.meanApprox(1000) - r) / r < 0.05
              True
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2095 from davies/approx and squashes the following commits:
      
      e8c252b [Davies Liu] add approx API for RDD
      8df4dad4
    • Davies Liu's avatar
      [SPARK-2871] [PySpark] add `key` argument for max(), min() and top(n) · db436e36
      Davies Liu authored
      RDD.max(key=None)
      
              param key: A function used to generate key for comparing
      
              >>> rdd = sc.parallelize([1.0, 5.0, 43.0, 10.0])
              >>> rdd.max()
              43.0
              >>> rdd.max(key=str)
              5.0
      
      RDD.min(key=None)
      
              Find the minimum item in this RDD.
      
              param key: A function used to generate key for comparing
      
              >>> rdd = sc.parallelize([2.0, 5.0, 43.0, 10.0])
              >>> rdd.min()
              2.0
              >>> rdd.min(key=str)
              10.0
      
      RDD.top(num, key=None)
      
              Get the top N elements from a RDD.
      
              Note: It returns the list sorted in descending order.
              >>> sc.parallelize([10, 4, 2, 12, 3]).top(1)
              [12]
              >>> sc.parallelize([2, 3, 4, 5, 6], 2).top(2)
              [6, 5]
              >>> sc.parallelize([10, 4, 2, 12, 3]).top(3, key=str)
              [4, 3, 2]
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2094 from davies/cmp and squashes the following commits:
      
      ccbaf25 [Davies Liu] add `key` to top()
      ad7e374 [Davies Liu] fix tests
      2f63512 [Davies Liu] change `comp` to `key` in min/max
      dd91e08 [Davies Liu] add `comp` argument for RDD.max() and RDD.min()
      db436e36
    • Michael Armbrust's avatar
      [SPARK-2967][SQL] Follow-up: Also copy hash expressions in sort based shuffle fix. · 3519b5e8
      Michael Armbrust authored
      Follow-up to #2066
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2072 from marmbrus/sortShuffle and squashes the following commits:
      
      2ff8114 [Michael Armbrust] Fix bug
      3519b5e8
    • Michael Armbrust's avatar
      [SPARK-2554][SQL] CountDistinct partial aggregation and object allocation improvements · 7e191fe2
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      Author: Gregory Owen <greowen@gmail.com>
      
      Closes #1935 from marmbrus/countDistinctPartial and squashes the following commits:
      
      5c7848d [Michael Armbrust] turn off caching in the constructor
      8074a80 [Michael Armbrust] fix tests
      32d216f [Michael Armbrust] reynolds comments
      c122cca [Michael Armbrust] Address comments, add tests
      b2e8ef3 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
      fae38f4 [Michael Armbrust] Fix style
      fdca896 [Michael Armbrust] cleanup
      93d0f64 [Michael Armbrust] metastore concurrency fix.
      db44a30 [Michael Armbrust] JIT hax.
      3868f6c [Michael Armbrust] Merge pull request #9 from GregOwen/countDistinctPartial
      c9e67de [Gregory Owen] Made SpecificRow and types serializable by Kryo
      2b46c4b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
      8ff6402 [Michael Armbrust] Add specific row.
      58d15f1 [Michael Armbrust] disable codegen logging
      87d101d [Michael Armbrust] Fix isNullAt bug
      abee26d [Michael Armbrust] WIP
      27984d0 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
      57ae3b1 [Michael Armbrust] Fix order dependent test
      b3d0f64 [Michael Armbrust] Add golden files.
      c1f7114 [Michael Armbrust] Improve tests / fix serialization.
      f31b8ad [Michael Armbrust] more fixes
      38c7449 [Michael Armbrust] comments and style
      9153652 [Michael Armbrust] better toString
      d494598 [Michael Armbrust] Fix tests now that the planner is better
      41fbd1d [Michael Armbrust] Never try and create an empty hash set.
      050bb97 [Michael Armbrust] Skip no-arg constructors for kryo,
      bd08239 [Michael Armbrust] WIP
      213ada8 [Michael Armbrust] First draft of partially aggregated and code generated count distinct / max
      7e191fe2
    • Yin Huai's avatar
      [SQL] Make functionRegistry in HiveContext transient. · 2fb1c72e
      Yin Huai authored
      Seems we missed `transient` for the `functionRegistry` in `HiveContext`.
      
      cc: marmbrus
      
      Author: Yin Huai <huaiyin.thu@gmail.com>
      
      Closes #2074 from yhuai/makeFunctionRegistryTransient and squashes the following commits:
      
      6534e7d [Yin Huai] Make functionRegistry transient.
      2fb1c72e
    • Liang-Chi Hsieh's avatar
      [Minor] fix typo · 76bb044b
      Liang-Chi Hsieh authored
      Fix a typo in comment.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #2105 from viirya/fix_typo and squashes the following commits:
      
      6596a80 [Liang-Chi Hsieh] fix typo.
      76bb044b
    • Daoyuan Wang's avatar
      [SPARK-3068]remove MaxPermSize option for jvm 1.8 · f3d65cd0
      Daoyuan Wang authored
      In JVM 1.8.0, MaxPermSize is no longer supported.
      In spark `stderr` output, there would be a line of
      
          Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #2011 from adrian-wang/maxpermsize and squashes the following commits:
      
      ef1d660 [Daoyuan Wang] direct get java version in runtime
      37db9c1 [Daoyuan Wang] code refine
      3c1d554 [Daoyuan Wang] remove MaxPermSize option for jvm 1.8
      f3d65cd0
    • Kousuke Saruta's avatar
      [SPARK-2963] REGRESSION - The description about how to build for using CLI and... · 323cd92b
      Kousuke Saruta authored
      [SPARK-2963] REGRESSION - The description about how to build for using CLI and Thrift JDBC server is absent in proper document  -
      
      The most important things I mentioned in #1885 is as follows.
      
      * People who build Spark is not always programmer.
      * If a person who build Spark is not a programmer, he/she won't read programmer's guide before building.
      
      So, how to build for using CLI and JDBC server is not only in programmer's guide.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2080 from sarutak/SPARK-2963 and squashes the following commits:
      
      ee07c76 [Kousuke Saruta] Modified regression of the description about building for using Thrift JDBC server and CLI
      ed53329 [Kousuke Saruta] Modified description and notaton of proper noun
      07c59fc [Kousuke Saruta] Added a description about how to build to use HiveServer and CLI for SparkSQL to building-with-maven.md
      6e6645a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2963
      c88fa93 [Kousuke Saruta] Added a description about building to use HiveServer and CLI for SparkSQL
      323cd92b
  5. Aug 22, 2014
    • Tathagata Das's avatar
      [SPARK-3169] Removed dependency on spark streaming test from spark flume sink · 30040741
      Tathagata Das authored
      Due to maven bug https://jira.codehaus.org/browse/MNG-1378, maven could not resolve spark streaming classes required by the spark-streaming test-jar dependency of external/flume-sink. There is no particular reason that the external/flume-sink has to depend on Spark Streaming at all, so I am eliminating this dependency. Also I have removed the exclusions present in the Flume dependencies, as there is no reason to exclude them (they were excluded in the external/flume module to prevent dependency collisions with Spark).
      
      Since Jenkins will test the sbt build and the unit test, I only tested maven compilation locally.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #2101 from tdas/spark-sink-pom-fix and squashes the following commits:
      
      8f42621 [Tathagata Das] Added Flume sink exclusions back, and added netty to test dependencies
      93b559f [Tathagata Das] Removed dependency on spark streaming test from spark flume sink
      30040741
    • Reynold Xin's avatar
      a5219db1
    • XuTingjun's avatar
      [SPARK-2742][yarn] delete useless variables · 220c2d76
      XuTingjun authored
      Author: XuTingjun <1039320815@qq.com>
      
      Closes #1614 from XuTingjun/yarn-bug and squashes the following commits:
      
      f07096e [XuTingjun] Update ClientArguments.scala
      220c2d76
  6. Aug 21, 2014
    • Joseph K. Bradley's avatar
      [SPARK-2840] [mllib] DecisionTree doc update (Java, Python examples) · 050f8d01
      Joseph K. Bradley authored
      Updated DecisionTree documentation, with examples for Java, Python.
      Added same Java example to code as well.
      CC: @mengxr  @manishamde @atalwalkar
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #2063 from jkbradley/dt-docs and squashes the following commits:
      
      2dd2c19 [Joseph K. Bradley] Last updates based on github review.
      9dd1b6b [Joseph K. Bradley] Updated decision tree doc.
      d802369 [Joseph K. Bradley] Updates based on comments: cache data, corrected doc text.
      b9bee04 [Joseph K. Bradley] Updated DT examples
      57eee9f [Joseph K. Bradley] Created JavaDecisionTree example from example in docs, and corrected doc example as needed.
      d939a92 [Joseph K. Bradley] Updated DecisionTree documentation.  Added Java, Python examples.
      050f8d01
  7. Aug 20, 2014
Loading