Skip to content
Snippets Groups Projects
  1. Jun 05, 2014
    • Patrick Wendell's avatar
      HOTFIX: Remove generated-mima-excludes file after runing MIMA. · f6143f12
      Patrick Wendell authored
      This has been causing some false failures on PR's that don't merge
      correctly.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #971 from pwendell/mima and squashes the following commits:
      
      1dc80aa [Patrick Wendell] HOTFIX: Remove generated-mima-excludes file after runing MIMA.
      f6143f12
    • Takuya UESHIN's avatar
      [SPARK-2036] [SQL] CaseConversionExpression should check if the evaluated value is null. · e4c11eef
      Takuya UESHIN authored
      `CaseConversionExpression` should check if the evaluated value is `null`.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #982 from ueshin/issues/SPARK-2036 and squashes the following commits:
      
      61e1c54 [Takuya UESHIN] Add check if the evaluated value is null.
      e4c11eef
    • CodingCat's avatar
      SPARK-1677: allow user to disable output dir existence checking · 89cdbb08
      CodingCat authored
      https://issues.apache.org/jira/browse/SPARK-1677
      
      For compatibility with older versions of Spark it would be nice to have an option `spark.hadoop.validateOutputSpecs` (default true)  for the user to disable the output directory existence checking
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #947 from CodingCat/SPARK-1677 and squashes the following commits:
      
      7930f83 [CodingCat] miao
      c0c0e03 [CodingCat] bug fix and doc update
      5318562 [CodingCat] bug fix
      13219b5 [CodingCat] allow user to disable output dir existence checking
      89cdbb08
    • Takuya UESHIN's avatar
      [SPARK-2029] Bump pom.xml version number of master branch to 1.1.0-SNAPSHOT. · 7c160293
      Takuya UESHIN authored
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #974 from ueshin/issues/SPARK-2029 and squashes the following commits:
      
      e19e8f4 [Takuya UESHIN] Bump version number to 1.1.0-SNAPSHOT.
      7c160293
    • Marcelo Vanzin's avatar
      Fix issue in ReplSuite with hadoop-provided profile. · b77c19be
      Marcelo Vanzin authored
      When building the assembly with the maven "hadoop-provided"
      profile, the executors were failing to come up because Hadoop classes
      were not found in the classpath anymore; so add them explicitly to
      the classpath using spark.executor.extraClassPath. This is only
      needed for the local-cluster mode, but doesn't affect other tests,
      so it's added for all of them to keep the code simpler.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #781 from vanzin/repl-test-fix and squashes the following commits:
      
      4f0a3b0 [Marcelo Vanzin] Fix issue in ReplSuite with hadoop-provided profile.
      b77c19be
  2. Jun 04, 2014
    • Ankur Dave's avatar
      Minor: Fix documentation error from apache/spark#946 · abea2d4f
      Ankur Dave authored
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #970 from ankurdave/SPARK-1991_docfix and squashes the following commits:
      
      6d07343 [Ankur Dave] Minor: Fix documentation error from apache/spark#946
      abea2d4f
    • Varakhedi Sujeet's avatar
      SPARK-1790: Update EC2 scripts to support r3 instance types · 11ded3f6
      Varakhedi Sujeet authored
      Author: Varakhedi Sujeet <svarakhedi@gopivotal.com>
      
      Closes #960 from sujeetv/ec2-r3 and squashes the following commits:
      
      3cb9fd5 [Varakhedi Sujeet] SPARK-1790: Update EC2 scripts to support r3 instance
      11ded3f6
    • Colin McCabe's avatar
      SPARK-1518: FileLogger: Fix compile against Hadoop trunk · 1765c8d0
      Colin McCabe authored
      In Hadoop trunk (currently Hadoop 3.0.0), the deprecated
      FSDataOutputStream#sync() method has been removed.  Instead, we should
      call FSDataOutputStream#hflush, which does the same thing as the
      deprecated method used to do.
      
      Author: Colin McCabe <cmccabe@cloudera.com>
      
      Closes #898 from cmccabe/SPARK-1518 and squashes the following commits:
      
      752b9d7 [Colin McCabe] FileLogger: Fix compile against Hadoop trunk
      1765c8d0
    • Xiangrui Meng's avatar
      [SPARK-1752][MLLIB] Standardize text format for vectors and labeled points · 189df165
      Xiangrui Meng authored
      We should standardize the text format used to represent vectors and labeled points. The proposed formats are the following:
      
      1. dense vector: `[v0,v1,..]`
      2. sparse vector: `(size,[i0,i1],[v0,v1])`
      3. labeled point: `(label,vector)`
      
      where "(..)" indicates a tuple and "[...]" indicate an array. `loadLabeledPoints` is added to pyspark's `MLUtils`. I didn't add `loadVectors` to pyspark because `RDD.saveAsTextFile` cannot stringify dense vectors in the proposed format automatically.
      
      `MLUtils#saveLabeledData` and `MLUtils#loadLabeledData` are deprecated. Users should use `RDD#saveAsTextFile` and `MLUtils#loadLabeledPoints` instead. In Scala, `MLUtils#loadLabeledPoints` is compatible with the format used by `MLUtils#loadLabeledData`.
      
      CC: @mateiz, @srowen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #685 from mengxr/labeled-io and squashes the following commits:
      
      2d1116a [Xiangrui Meng] make loadLabeledData/saveLabeledData deprecated since 1.0.1
      297be75 [Xiangrui Meng] change LabeledPoint.parse to LabeledPointParser.parse to maintain binary compatibility
      d6b1473 [Xiangrui Meng] Merge branch 'master' into labeled-io
      56746ea [Xiangrui Meng] replace # by .
      623a5f0 [Xiangrui Meng] merge master
      f06d5ba [Xiangrui Meng] add docs and minor updates
      640fe0c [Xiangrui Meng] throw SparkException
      5bcfbc4 [Xiangrui Meng] update test to add scientific notations
      e86bf38 [Xiangrui Meng] remove NumericTokenizer
      050fca4 [Xiangrui Meng] use StringTokenizer
      6155b75 [Xiangrui Meng] merge master
      f644438 [Xiangrui Meng] remove parse methods based on eval from pyspark
      a41675a [Xiangrui Meng] python loadLabeledPoint uses Scala's implementation
      ce9a475 [Xiangrui Meng] add deserialize_labeled_point to pyspark with tests
      e9fcd49 [Xiangrui Meng] add serializeLabeledPoint and tests
      aea4ae3 [Xiangrui Meng] minor updates
      810d6df [Xiangrui Meng] update tokenizer/parser implementation
      7aac03a [Xiangrui Meng] remove Scala parsers
      c1885c1 [Xiangrui Meng] add headers and minor changes
      b0c50cb [Xiangrui Meng] add customized parser
      d731817 [Xiangrui Meng] style update
      63dc396 [Xiangrui Meng] add loadLabeledPoints to pyspark
      ea122b5 [Xiangrui Meng] Merge branch 'master' into labeled-io
      cd6c78f [Xiangrui Meng] add __str__ and parse to LabeledPoint
      a7a178e [Xiangrui Meng] add stringify to pyspark's Vectors
      5c2dbfa [Xiangrui Meng] add parse to pyspark's Vectors
      7853f88 [Xiangrui Meng] update pyspark's SparseVector.__str__
      e761d32 [Xiangrui Meng] make LabelPoint.parse compatible with the dense format used before v1.0 and deprecate loadLabeledData and saveLabeledData
      9e63a02 [Xiangrui Meng] add loadVectors and loadLabeledPoints
      19aa523 [Xiangrui Meng] update toString and add parsers for Vectors and LabeledPoint
      189df165
    • Sean Owen's avatar
      SPARK-1973. Add randomSplit to JavaRDD (with tests, and tidy Java tests) · d341b17c
      Sean Owen authored
      I'd like to use randomSplit through the Java API, and would like to add a convenience wrapper for this method to JavaRDD. This is fairly trivial. (In fact, is the intent that JavaRDD not wrap every RDD method? and that sometimes users should just use JavaRDD.wrapRDD()?)
      
      Along the way, I added tests for it, and also touched up the Java API test style and behavior. This is maybe the more useful part of this small change.
      
      Author: Sean Owen <sowen@cloudera.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Xiangrui Meng <meng@databricks.com>
      
      Closes #919 from srowen/SPARK-1973 and squashes the following commits:
      
      148cb7b [Sean Owen] Some final Java test polish, while we are at it
      1fc3f3e [Xiangrui Meng] more cleaning on Java 8 tests
      9ebc57f [Sean Owen] Use accumulator instead of temp files to test foreach
      5efb0be [Sean Owen] Add Java randomSplit, and unit tests (including for sample)
      5dcc158 [Sean Owen] Simplified Java 8 test with new language features, and fixed the name of MLB's greatest team
      91a1769 [Sean Owen] Touch up minor style issues in existing Java API suite test
      d341b17c
    • Neville Li's avatar
      [MLLIB] set RDD names in ALS · b8d25800
      Neville Li authored
      This is very useful when debugging & fine tuning jobs with large data sets.
      
      Author: Neville Li <neville@spotify.com>
      
      Closes #966 from nevillelyh/master and squashes the following commits:
      
      6747764 [Neville Li] [MLLIB] use string interpolation for RDD names
      3b15d34 [Neville Li] [MLLIB] set RDD names in ALS
      b8d25800
    • Kan Zhang's avatar
      [SPARK-1817] RDD.zip() should verify partition sizes for each partition · c402a4a6
      Kan Zhang authored
      RDD.zip() will throw an exception if it finds partition sizes are not the same.
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #944 from kanzhang/SPARK-1817 and squashes the following commits:
      
      c073848 [Kan Zhang] [SPARK-1817] Cosmetic updates
      524c670 [Kan Zhang] [SPARK-1817] RDD.zip() should verify partition sizes for each partition
      c402a4a6
    • Sean Owen's avatar
      SPARK-1806 (addendum) Use non-deprecated methods in Mesos 0.18 · 4ca06256
      Sean Owen authored
      The update to Mesos 0.18 caused some deprecation warnings in the build. The change to the non-deprecated version is straightforward as it emulates what the Mesos driver does with the deprecated method anyway (https://github.com/apache/mesos/blob/c5aa1dd22155d79c5a7c33076319299a40fd63b3/src/sched/sched.cpp#L1354)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #920 from srowen/SPARK-1806 and squashes the following commits:
      
      8d76b6a [Sean Owen] Use non-deprecated methods in Mesos 0.18
      4ca06256
    • Aaron Davidson's avatar
      Update spark-ec2 scripts for 1.0.0 on master · ab7c62d5
      Aaron Davidson authored
      The change was previously committed only to branch-1.0 as part of https://github.com/apache/spark/commit/a34e6fda1d6fb8e769c21db70845f1a6dde968d8
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Patrick Wendell <pwendell@gmail.com>
      
      Closes #938 from aarondav/sparkec2 and squashes the following commits:
      
      067cc31 [Aaron Davidson] Update spark-ec2 scripts for 1.0.0 on master
      ab7c62d5
  3. Jun 03, 2014
    • Joseph E. Gonzalez's avatar
      Enable repartitioning of graph over different number of partitions · 5284ca78
      Joseph E. Gonzalez authored
      It is currently very difficult to repartition a graph over a different number of partitions.  This PR adds an additional `partitionBy` function that takes the number of partitions.
      
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      
      Closes #719 from jegonzal/graph_partitioning_options and squashes the following commits:
      
      730b405 [Joseph E. Gonzalez] adding an additional number of partitions option to partitionBy
      5284ca78
    • Xiangrui Meng's avatar
      use env default python in merge_spark_pr.py · e8d93ee5
      Xiangrui Meng authored
      A minor change to use env default python instead of fixed `/usr/bin/python`.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #965 from mengxr/merge-pr-python and squashes the following commits:
      
      1ae0013 [Xiangrui Meng] use env default python in merge_spark_pr.py
      e8d93ee5
    • Reynold Xin's avatar
      SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog. · 1faef149
      Reynold Xin authored
      I also corrected some errors made in the previous HLL count approximate API, including relativeSD wasn't really a measure for error (and we used it to test error bounds in test results).
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #897 from rxin/hll and squashes the following commits:
      
      4d83f41 [Reynold Xin] New error bound and non-randomness.
      f154ea0 [Reynold Xin] Added a comment on the value bound for testing.
      e367527 [Reynold Xin] One more round of code review.
      41e649a [Reynold Xin] Update final mima list.
      9e320c8 [Reynold Xin] Incorporate code review feedback.
      e110d70 [Reynold Xin] Merge branch 'master' into hll
      354deb8 [Reynold Xin] Added comment on the Mima exclude rules.
      acaa524 [Reynold Xin] Added the right exclude rules in MimaExcludes.
      6555bfe [Reynold Xin] Added a default method and re-arranged MimaExcludes.
      1db1522 [Reynold Xin] Excluded util.SerializableHyperLogLog from MIMA check.
      9221b27 [Reynold Xin] Merge branch 'master' into hll
      88cfe77 [Reynold Xin] Updated documentation and restored the old incorrect API to maintain API compatibility.
      1294be6 [Reynold Xin] Updated HLL+.
      e7786cb [Reynold Xin] Merge branch 'master' into hll
      c0ef0c2 [Reynold Xin] SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog.
      1faef149
    • Kan Zhang's avatar
      [SPARK-1161] Add saveAsPickleFile and SparkContext.pickleFile in Python · 21e40ed8
      Kan Zhang authored
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #755 from kanzhang/SPARK-1161 and squashes the following commits:
      
      24ed8a2 [Kan Zhang] [SPARK-1161] Fixing doc tests
      44e0615 [Kan Zhang] [SPARK-1161] Adding an optional batchSize with default value 10
      d929429 [Kan Zhang] [SPARK-1161] Add saveAsObjectFile and SparkContext.objectFile in Python
      21e40ed8
    • DB Tsai's avatar
      Fixed a typo · f4dd665c
      DB Tsai authored
      in RowMatrix.scala
      
      Author: DB Tsai <dbtsai@dbtsai.com>
      
      Closes #959 from dbtsai/dbtsai-typo and squashes the following commits:
      
      fab0e0e [DB Tsai] Fixed typo
      f4dd665c
    • Ankur Dave's avatar
      [SPARK-1991] Support custom storage levels for vertices and edges · b1feb602
      Ankur Dave authored
      This PR adds support for specifying custom storage levels for the vertices and edges of a graph. This enables GraphX to handle graphs larger than memory size by specifying MEMORY_AND_DISK and then repartitioning the graph to use many small partitions, each of which does fit in memory. Spark will then automatically load partitions from disk as needed.
      
      The user specifies the desired vertex and edge storage levels when building the graph by passing them to the graph constructor. These are then stored in the `targetStorageLevel` attribute of the VertexRDD and EdgeRDD respectively. Whenever GraphX needs to cache a VertexRDD or EdgeRDD (because it plans to use it more than once, for example), it uses the specified target storage level. Also, when the user calls `Graph#cache()`, the vertices and edges are persisted using their target storage levels.
      
      In order to facilitate propagating the target storage levels across VertexRDD and EdgeRDD operations, we remove raw calls to the constructors and instead introduce the `withPartitionsRDD` and `withTargetStorageLevel` methods.
      
      I tested this change by running PageRank and triangle count on a severely memory-constrained cluster (1 executor with 300 MB of memory, and a 1 GB graph). Before this PR, these algorithms used to fail with OutOfMemoryErrors. With this PR, and using the DISK_ONLY storage level, they succeed.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #946 from ankurdave/SPARK-1991 and squashes the following commits:
      
      ce17d95 [Ankur Dave] Move pickStorageLevel to StorageLevel.fromString
      ccaf06f [Ankur Dave] Shadow members in withXYZ() methods rather than using underscores
      c34abc0 [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0
      c5ca068 [Ankur Dave] Revert "Exclude all of GraphX from binary compatibility checks"
      34bcefb [Ankur Dave] Exclude all of GraphX from binary compatibility checks
      6fdd137 [Ankur Dave] [SPARK-1991] Support custom storage levels for vertices and edges
      b1feb602
    • Joseph E. Gonzalez's avatar
      Synthetic GraphX Benchmark · 894ecde0
      Joseph E. Gonzalez authored
      This PR accomplishes two things:
      
      1. It introduces a Synthetic Benchmark application that generates an arbitrarily large log-normal graph and executes either PageRank or connected components on the graph.  This can be used to profile GraphX system on arbitrary clusters without access to large graph datasets
      
      2. This PR improves the implementation of the log-normal graph generator.
      
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #720 from jegonzal/graphx_synth_benchmark and squashes the following commits:
      
      e40812a [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0
      bccccad [Ankur Dave] Fix long lines
      374678a [Ankur Dave] Bugfix and style changes
      1bdf39a [Joseph E. Gonzalez] updating options
      d943972 [Joseph E. Gonzalez] moving the benchmark application into the examples folder.
      f4f839a [Joseph E. Gonzalez] Creating a synthetic benchmark script.
      894ecde0
    • baishuo(白硕)'s avatar
      fix java.lang.ClassCastException · aa41a522
      baishuo(白硕) authored
      get Exception when run:bin/run-example org.apache.spark.examples.sql.RDDRelation
      Exception's detail is:
      Exception in thread "main" java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
      	at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
      	at org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:145)
      	at org.apache.spark.examples.sql.RDDRelation$.main(RDDRelation.scala:49)
      	at org.apache.spark.examples.sql.RDDRelation.main(RDDRelation.scala)
      change sql("SELECT COUNT(*) FROM records").collect().head.getInt(0) to sql("SELECT COUNT(*) FROM records").collect().head.getLong(0), then the Exception do not occur any more
      
      Author: baishuo(白硕) <vc_java@hotmail.com>
      
      Closes #949 from baishuo/master and squashes the following commits:
      
      f4b319f [baishuo(白硕)] fix java.lang.ClassCastException
      aa41a522
    • Erik Selin's avatar
      [SPARK-1468] Modify the partition function used by partitionBy. · 8edc9d03
      Erik Selin authored
      Make partitionBy use a tweaked version of hash as its default partition function
      since the python hash function does not consistently assign the same value
      to None across python processes.
      
      Associated JIRA at https://issues.apache.org/jira/browse/SPARK-1468
      
      Author: Erik Selin <erik.selin@jadedpixel.com>
      
      Closes #371 from tyro89/consistent_hashing and squashes the following commits:
      
      201c301 [Erik Selin] Make partitionBy use a tweaked version of hash as its default partition function since the python hash function does not consistently assign the same value to None across python processes.
      8edc9d03
    • tzolov's avatar
      Add support for Pivotal HD in the Maven build: SPARK-1992 · b1f28535
      tzolov authored
      Allow Spark to build against particular Pivotal HD distributions. For example to build Spark against Pivotal HD 2.0.1 one can run:
      ```
      mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0-gphd-3.0.1.0 -DskipTests clean package
      ```
      
      Author: tzolov <christian.tzolov@gmail.com>
      
      Closes #942 from tzolov/master and squashes the following commits:
      
      bc3e05a [tzolov] Add support for Pivotal HD in the Maven build and SBT build: [SPARK-1992]
      b1f28535
    • Wenchen Fan(Cloud)'s avatar
      [SPARK-1912] fix compress memory issue during reduce · 45e9bc85
      Wenchen Fan(Cloud) authored
      When we need to read a compressed block, we will first create a compress stream instance(LZF or Snappy) and use it to wrap that block.
      Let's say a reducer task need to read 1000 local shuffle blocks, it will first prepare to read that 1000 blocks, which means create 1000 compression stream instance to wrap them. But the initialization of compression instance will allocate some memory and when we have many compression instance at the same time, it is a problem.
      Actually reducer reads the shuffle blocks one by one, so we can do the compression instance initialization lazily.
      
      Author: Wenchen Fan(Cloud) <cloud0fan@gmail.com>
      
      Closes #860 from cloud-fan/fix-compress and squashes the following commits:
      
      0924a6b [Wenchen Fan(Cloud)] rename 'doWork' into 'getIterator'
      07f32c22 [Wenchen Fan(Cloud)] move the LazyProxyIterator to dataDeserialize
      d80c426 [Wenchen Fan(Cloud)] remove empty lines in short class
      2c8adb2 [Wenchen Fan(Cloud)] add inline comment
      8ebff77 [Wenchen Fan(Cloud)] fix compress memory issue during reduce
      45e9bc85
    • Henry Saputra's avatar
      SPARK-2001 : Remove docs/spark-debugger.md from master · 6c044ed1
      Henry Saputra authored
      Per discussion in dev list:
      "
      Seemed like the spark-debugger.md is no longer accurate (see
      http://spark.apache.org/docs/latest/spark-debugger.html) and since it
      was originally written Spark has evolved that makes the doc obsolete.
      There are already work pending for new replay debugging (I could not
      find the PR links for it) so I
      With version control we could always reinstate the old doc if needed,
      but as of today the doc is no longer reflect the current state of
      Spark's RDD.
      "
      
      Author: Henry Saputra <henry.saputra@gmail.com>
      
      Closes #953 from hsaputra/SPARK-2001-hsaputra and squashes the following commits:
      
      dc324aa [Henry Saputra] SPARK-2001 : Remove docs/spark-debugger.md from master since it is obsolete
      6c044ed1
    • Syed Hashmi's avatar
      [SPARK-1942] Stop clearing spark.driver.port in unit tests · 7782a304
      Syed Hashmi authored
      stop resetting spark.driver.port in unit tests (scala, java and python).
      
      Author: Syed Hashmi <shashmi@cloudera.com>
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #943 from syedhashmi/master and squashes the following commits:
      
      885f210 [Syed Hashmi] Removing unnecessary file (created by mergetool)
      b8bd4b5 [Syed Hashmi] Merge remote-tracking branch 'upstream/master'
      b895e59 [Syed Hashmi] Revert "[SPARK-1784] Add a new partitioner"
      57b6587 [Syed Hashmi] Revert "[SPARK-1784] Add a balanced partitioner"
      1574769 [Syed Hashmi] [SPARK-1942] Stop clearing spark.driver.port in unit tests
      4354836 [Syed Hashmi] Revert "SPARK-1686: keep schedule() calling in the main thread"
      fd36542 [Syed Hashmi] [SPARK-1784] Add a balanced partitioner
      6668015 [CodingCat] SPARK-1686: keep schedule() calling in the main thread
      4ca94cc [Syed Hashmi] [SPARK-1784] Add a new partitioner
      7782a304
  4. Jun 02, 2014
    • Cheng Lian's avatar
      Avoid dynamic dispatching when unwrapping Hive data. · 862283e9
      Cheng Lian authored
      This is a follow up of PR #758.
      
      The `unwrapHiveData` function is now composed statically before actual rows are scanned according to the field object inspector to avoid dynamic dispatching cost.
      
      According to the same micro benchmark used in PR #758, this simple change brings slight performance boost: 2.5% for CSV table and 1% for RCFile table.
      
      ```
      Optimized version:
      
      CSV: 6870 ms, RCFile: 5687 ms
      CSV: 6832 ms, RCFile: 5800 ms
      CSV: 6822 ms, RCFile: 5679 ms
      CSV: 6704 ms, RCFile: 5758 ms
      CSV: 6819 ms, RCFile: 5725 ms
      
      Original version:
      
      CSV: 7042 ms, RCFile: 5667 ms
      CSV: 6883 ms, RCFile: 5703 ms
      CSV: 7115 ms, RCFile: 5665 ms
      CSV: 7020 ms, RCFile: 5981 ms
      CSV: 6871 ms, RCFile: 5906 ms
      ```
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #935 from liancheng/staticUnwrapping and squashes the following commits:
      
      c49c70c [Cheng Lian] Avoid dynamic dispatching when unwrapping Hive data.
      862283e9
    • egraldlo's avatar
      [SPARK-1995][SQL] system function upper and lower can be supported · ec8be274
      egraldlo authored
      I don't know whether it's time to implement system function about string operation in spark sql now.
      
      Author: egraldlo <egraldlo@gmail.com>
      
      Closes #936 from egraldlo/stringoperator and squashes the following commits:
      
      3c6c60a [egraldlo] Add UPPER, LOWER, MAX and MIN into hive parser
      ea76d0a [egraldlo] modify the formatting issues
      b49f25e [egraldlo] modify the formatting issues
      1f0bbb5 [egraldlo] system function upper and lower supported
      13d3267 [egraldlo] system function upper and lower supported
      ec8be274
    • Cheng Lian's avatar
      [SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on... · d000ca98
      Cheng Lian authored
      [SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan.
      
      In cases like `Limit` and `TakeOrdered`, `executeCollect()` makes optimizations that `execute().collect()` will not.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #939 from liancheng/spark-1958 and squashes the following commits:
      
      bdc4a14 [Cheng Lian] Copy rows to present immutable data to users
      8250976 [Cheng Lian] Added return type explicitly for public API
      192a25c [Cheng Lian] [SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan.
      d000ca98
    • Tor Myklebust's avatar
      [SPARK-1553] Alternating nonnegative least-squares · 9a5d482e
      Tor Myklebust authored
      This pull request includes a nonnegative least-squares solver (NNLS) tailored to the kinds of small-scale problems that come up when training matrix factorisation models by alternating nonnegative least-squares (ANNLS).
      
      The method used for the NNLS subproblems is based on the classical method of projected gradients.  There is a modification where, if the set of active constraints has not changed since the last iteration, a conjugate gradient step is considered and possibly rejected in favour of the gradient; this improves convergence once the optimal face has been located.
      
      The NNLS solver is in `org.apache.spark.mllib.optimization.NNLSbyPCG`.
      
      Author: Tor Myklebust <tmyklebu@gmail.com>
      
      Closes #460 from tmyklebu/annls and squashes the following commits:
      
      79bc4b5 [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark into annls
      199b0bc [Tor Myklebust] Make the ctor private again and use the builder pattern.
      7fbabf1 [Tor Myklebust] Cleanup matrix math in NNLSSuite.
      65ef7f2 [Tor Myklebust] Make ALS's ctor public and remove a couple of "convenience" wrappers.
      2d4f3cb [Tor Myklebust] Cleanup.
      0cb4481 [Tor Myklebust] Drop the iteration limit from 40k to max(400,20n).
      e2a01d1 [Tor Myklebust] Create a workspace object for NNLS to cut down on memory allocations.
      b285106 [Tor Myklebust] Clean up NNLS test cases.
      9c820b6 [Tor Myklebust] Tweak variable names.
      8a1a436 [Tor Myklebust] Describe the problem and add a reference to Polyak's paper.
      5345402 [Tor Myklebust] Style fixes that got eaten.
      ac673bd [Tor Myklebust] More safeguards against numerical ridiculousness.
      c288b6a [Tor Myklebust] Finish moving the NNLS solver.
      9a82fa6 [Tor Myklebust] Fix scalastyle moanings.
      33bf4f2 [Tor Myklebust] Fix missing space.
      89ea0a8 [Tor Myklebust] Hack ALSSuite to support NNLS testing.
      f5dbf4d [Tor Myklebust] Teach ALS how to use the NNLS solver.
      6cb563c [Tor Myklebust] Tests for the nonnegative least squares solver.
      a68ac10 [Tor Myklebust] A nonnegative least-squares solver.
      9a5d482e
    • Ankur Dave's avatar
      Add landmark-based Shortest Path algorithm to graphx.lib · 9535f404
      Ankur Dave authored
      This is a modified version of apache/spark#10.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      Author: Andres Perez <andres@tresata.com>
      
      Closes #933 from ankurdave/shortestpaths and squashes the following commits:
      
      03a103c [Ankur Dave] Style fixes
      7a1ff48 [Ankur Dave] Improve ShortestPaths documentation
      d75c8fc [Ankur Dave] Remove unnecessary VD type param, and pass through ED
      d983fb4 [Ankur Dave] Fix style errors
      60ed8e6 [Andres Perez] Add Shortest-path computations to graphx.lib with unit tests.
      9535f404
  5. Jun 01, 2014
    • Patrick Wendell's avatar
      Better explanation for how to use MIMA excludes. · d17d2214
      Patrick Wendell authored
      This patch does a few things:
      1. We have a file MimaExcludes.scala exclusively for excludes.
      2. The test runner tells users about that file if a test fails.
      3. I've added back the excludes used from 0.9->1.0. We should keep
         these in the project as an official audit trail of times where
         we decided to make exceptions.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #937 from pwendell/mima and squashes the following commits:
      
      7ee0db2 [Patrick Wendell] Better explanation for how to use MIMA excludes.
      d17d2214
    • Reynold Xin's avatar
      Made spark_ec2.py PEP8 compliant. · eea3aab4
      Reynold Xin authored
      The change set is actually pretty small -- mostly whitespace changes. Admittedly this is a scary change due to the lack of tests to cover the ec2 scripts, and also because indentation actually impacts control flow in Python ...
      
      Look at changes without whitespace diff here: https://github.com/apache/spark/pull/891/files?w=1
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #891 from rxin/spark-ec2-pep8 and squashes the following commits:
      
      ac1bf11 [Reynold Xin] Made spark_ec2.py PEP8 compliant.
      eea3aab4
  6. May 31, 2014
    • Yadid Ayzenberg's avatar
      updated java code blocks in spark SQL guide such that ctx will refer to ... · 366c0c4c
      Yadid Ayzenberg authored
      ...a JavaSparkContext and sqlCtx will refer to a JavaSQLContext
      
      Author: Yadid Ayzenberg <yadid@media.mit.edu>
      
      Closes #932 from yadid/master and squashes the following commits:
      
      f92fb3a [Yadid Ayzenberg] updated java code blocks in spark SQL guide such that ctx will refer to a JavaSparkContext and sqlCtx will refer to a JavaSQLContext
      366c0c4c
    • Uri Laserson's avatar
      SPARK-1917: fix PySpark import of scipy.special functions · 5e98967b
      Uri Laserson authored
      https://issues.apache.org/jira/browse/SPARK-1917
      
      Author: Uri Laserson <laserson@cloudera.com>
      
      Closes #866 from laserson/SPARK-1917 and squashes the following commits:
      
      d947e8c [Uri Laserson] Added test for scipy.special importing
      1798bbd [Uri Laserson] SPARK-1917: fix PySpark import of scipy.special
      5e98967b
    • witgo's avatar
      Improve maven plugin configuration · d8c005d5
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #786 from witgo/maven_plugin and squashes the following commits:
      
      5de86a2 [witgo] Merge branch 'master' of https://github.com/apache/spark into maven_plugin
      c35ef73 [witgo] Improve maven plugin configuration
      d8c005d5
    • Aaron Davidson's avatar
      SPARK-1839: PySpark RDD#take() shouldn't always read from driver · 9909efc1
      Aaron Davidson authored
      This patch simply ports over the Scala implementation of RDD#take(), which reads the first partition at the driver, then decides how many more partitions it needs to read and will possibly start a real job if it's more than 1. (Note that SparkContext#runJob(allowLocal=true) only runs the job locally if there's 1 partition selected and no parent stages.)
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #922 from aarondav/take and squashes the following commits:
      
      fa06df9 [Aaron Davidson] SPARK-1839: PySpark RDD#take() shouldn't always read from driver
      9909efc1
    • Aaron Davidson's avatar
      Super minor: Close inputStream in SparkSubmitArguments · 7d52777e
      Aaron Davidson authored
      `Properties#load()` doesn't close the InputStream, but it'd be closed after being GC'd anyway...
      
      Also changed file.getName to file, because getName only shows the filename. This will show the full (possibly relative) path, which is less confusing if it's not found.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #914 from aarondav/tiny and squashes the following commits:
      
      db9d072 [Aaron Davidson] Super minor: Close inputStream in SparkSubmitArguments
      7d52777e
    • Michael Armbrust's avatar
      [SQL] SPARK-1964 Add timestamp to hive metastore type parser. · 1a0da0ec
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #913 from marmbrus/timestampMetastore and squashes the following commits:
      
      8e0154f [Michael Armbrust] Add timestamp to hive metastore type parser.
      1a0da0ec
Loading