Skip to content
Snippets Groups Projects
  1. Jun 21, 2014
  2. Jun 20, 2014
    • Marcelo Vanzin's avatar
      Fix some tests. · 648553d4
      Marcelo Vanzin authored
      - JavaAPISuite was trying to compare a bare path with a URI. Fix by
        extracting the path from the URI, since we know it should be a
        local path anyway/
      
      - b9be1609 excluded the ASM dependency everywhere, but easymock needs
        it (because cglib needs it). So re-add the dependency, with test
        scope this time.
      
      The second one above actually uncovered a weird situation: the maven
      test target works, even though I can't find the class sbt complains
      about in its classpath. sbt complains with:
      
        [error] Uncaught exception when running org.apache.spark.util
        .random.RandomSamplerSuite: java.lang.NoClassDefFoundError:
        org/objectweb/asm/Type
      
      To avoid more weirdness caused by that, I explicitly added the asm
      dependency to both maven and sbt (for tests only), and verified
      the classes don't end up in the final assembly.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #917 from vanzin/flaky-tests and squashes the following commits:
      
      d022320 [Marcelo Vanzin] Fix some tests.
      648553d4
  3. Jun 17, 2014
    • Yin Huai's avatar
      [SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL · d2f4f30b
      Yin Huai authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-2060
      
      Programming guide: http://yhuai.github.io/site/sql-programming-guide.html
      
      Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #999 from yhuai/newJson and squashes the following commits:
      
      227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      ce8eedd [Yin Huai] rxin's comments.
      bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      94ffdaa [Yin Huai] Remove "get" from method names.
      ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      79ea9ba [Yin Huai] Fix typos.
      5428451 [Yin Huai] Newline
      1f908ce [Yin Huai] Remove extra line.
      d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      7ea750e [Yin Huai] marmbrus's comments.
      6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      83013fb [Yin Huai] Update Java Example.
      e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map.
      6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      4fbddf0 [Yin Huai] Programming guide.
      9df8c5a [Yin Huai] Python API.
      7027634 [Yin Huai] Java API.
      cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset.
      d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      ab810b0 [Yin Huai] Make JsonRDD private.
      6df0891 [Yin Huai] Apache header.
      8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema.
      8ffed79 [Yin Huai] Update the example.
      a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution.
      65b87f0 [Yin Huai] Fix sampling...
      8846af5 [Yin Huai] API doc.
      52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      0387523 [Yin Huai] Address PR comments.
      666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      a2313a6 [Yin Huai] Address PR comments.
      f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used.
      0576406 [Yin Huai] Add Apache license header.
      af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD.
      f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
      d2f4f30b
  4. Jun 12, 2014
    • Andrew Or's avatar
      [Minor] Fix style, formatting and naming in BlockManager etc. · 44daec5a
      Andrew Or authored
      This is a precursor to a bigger change. I wanted to separate out the relatively insignificant changes so the ultimate PR is not inflated.
      
      (Warning: this PR is full of unimportant nitpicks)
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1058 from andrewor14/bm-minor and squashes the following commits:
      
      8e12eaf [Andrew Or] SparkException -> BlockException
      c36fd53 [Andrew Or] Make parts of BlockManager more readable
      0a5f378 [Andrew Or] Entry -> MemoryEntry
      e9762a5 [Andrew Or] Tone down string interpolation (minor reverts)
      c4de9ac [Andrew Or] Merge branch 'master' of github.com:apache/spark into bm-minor
      b3470f1 [Andrew Or] More string interpolation (minor)
      7f9dcab [Andrew Or] Use string interpolation (minor)
      94a425b [Andrew Or] Refactor against duplicate code + minor changes
      8a6a7dc [Andrew Or] Exception -> SparkException
      97c410f [Andrew Or] Deal with MIMA excludes
      2480f1d [Andrew Or] Fixes in StorgeLevel.scala
      abb0163 [Andrew Or] Style, formatting and naming fixes
      44daec5a
    • Doris Xin's avatar
      SPARK-1939 Refactor takeSample method in RDD to use ScaSRS · 1de1d703
      Doris Xin authored
      Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      Author: dorx <doris.s.xin@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #916 from dorx/takeSample and squashes the following commits:
      
      5b061ae [Doris Xin] merge master
      444e750 [Doris Xin] edge cases
      3de882b [dorx] Merge pull request #2 from mengxr/SPARK-1939
      82dde31 [Xiangrui Meng] update pyspark's takeSample
      48d954d [Doris Xin] remove unused imports from RDDSuite
      fb1452f [Doris Xin] allowing num to be greater than count in all cases
      1481b01 [Doris Xin] washing test tubes and making coffee
      dc699f3 [Doris Xin] give back imports removed by accident in rdd.py
      64e445b [Doris Xin] logwarnning as soon as it enters the while loop
      55518ed [Doris Xin] added TODO for logging in rdd.py
      eff89e2 [Doris Xin] addressed reviewer comments.
      ecab508 [Doris Xin] "fixed checkstyle violation
      0a9b3e3 [Doris Xin] "reviewer comment addressed"
      f80f270 [Doris Xin] Merge branch 'master' into takeSample
      ae3ad04 [Doris Xin] fixed edge cases to prevent overflow
      065ebcd [Doris Xin] Merge branch 'master' into takeSample
      9bdd36e [Doris Xin] Check sample size and move computeFraction
      e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
      7cab53a [Doris Xin] fixed import bug in rdd.py
      ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
      1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
      1de1d703
    • Patrick Wendell's avatar
      SPARK-1843: Replace assemble-deps with env variable. · 1c04652c
      Patrick Wendell authored
      (This change is actually small, I moved some logic into
      compute-classpath that was previously in spark-class).
      
      Assemble deps has existed for a while to allow developers to
      run local code with new changes quickly. When I'm developing I
      typically use a simpler approach which just prepends the Spark
      classes to the classpath before the assembly jar. This is well
      defined in the JVM and the Spark classes take precedence over those
      in the assembly.
      
      This approach is portable across both builds which is the main reason I'd
      like to switch to it. It's also a bit easier to toggle on and off quickly.
      
      The way you use this is the following:
      ```
      $ ./bin/spark-shell # Use spark with the normal assembly
      $ export SPARK_PREPEND_CLASSES=true
      $ ./bin/spark-shell # Now it's using compiled classes
      $ unset SPARK_PREPEND_CLASSES
      $ ./bin/spark-shell # Back to normal
      ```
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #877 from pwendell/assemble-deps and squashes the following commits:
      
      8a11345 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into assemble-deps
      faa3168 [Patrick Wendell] Adding a warning for compatibility
      3f151a7 [Patrick Wendell] Small fix
      bbfb73c [Patrick Wendell] Review feedback
      328e9f8 [Patrick Wendell] SPARK-1843: Replace assemble-deps with env variable.
      1c04652c
    • Sandy Ryza's avatar
      SPARK-554. Add aggregateByKey. · ce92a9c1
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #705 from sryza/sandy-spark-554 and squashes the following commits:
      
      2302b8f [Sandy Ryza] Add MIMA exclude
      f52e0ad [Sandy Ryza] Fix Python tests for real
      2f3afa3 [Sandy Ryza] Fix Python test
      0b735e9 [Sandy Ryza] Fix line lengths
      ae56746 [Sandy Ryza] Fix doc (replace T with V)
      c2be415 [Sandy Ryza] Java and Python aggregateByKey
      23bf400 [Sandy Ryza] SPARK-554.  Add aggregateByKey.
      ce92a9c1
  5. Jun 11, 2014
    • Tor Myklebust's avatar
      [SPARK-1672][MLLIB] Separate user and product partitioning in ALS · d9203350
      Tor Myklebust authored
      Some clean up work following #593.
      
      1. Allow to set different number user blocks and number product blocks in `ALS`.
      2. Update `MovieLensALS` to reflect the change.
      
      Author: Tor Myklebust <tmyklebu@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1014 from mengxr/SPARK-1672 and squashes the following commits:
      
      0e910dd [Xiangrui Meng] change private[this] to private[recommendation]
      36420c7 [Xiangrui Meng] set exclusion rules for ALS
      9128b77 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-1672
      294efe9 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-1672
      9bab77b [Xiangrui Meng] clean up add numUserBlocks and numProductBlocks to MovieLensALS
      84c8e8c [Xiangrui Meng] Merge branch 'master' into SPARK-1672
      d17a8bf [Xiangrui Meng] merge master
      a4925fd [Tor Myklebust] Style.
      bd8a75c [Tor Myklebust] Merge branch 'master' of github.com:apache/spark into alsseppar
      021f54b [Tor Myklebust] Separate user and product blocks.
      dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go.
      23d6f91 [Tor Myklebust] Stop making the partitioner configurable.
      495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
      674933a [Tor Myklebust] Fix style.
      40edc23 [Tor Myklebust] Fix missing space.
      f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach.
      5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'.
      36a0f43 [Tor Myklebust] Make the partitioner private.
      d872b09 [Tor Myklebust] Add negative id ALS test.
      df27697 [Tor Myklebust] Support custom partitioners.  Currently we use the same partitioner for users and products.
      c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing.
      c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly.
      d9203350
    • Prashant Sharma's avatar
      [SPARK-2069] MIMA false positives · 5b754b45
      Prashant Sharma authored
      Fixes SPARK 2070 and 2071
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #1021 from ScrapCodes/SPARK-2070/package-private-methods and squashes the following commits:
      
      7979a57 [Prashant Sharma] addressed code review comments
      558546d [Prashant Sharma] A little fancy error message.
      59275ab [Prashant Sharma] SPARK-2071 Mima ignores classes and its members from previous versions too.
      0c4ff2b [Prashant Sharma] SPARK-2070 Ignore methods along with annotated classes.
      5b754b45
  6. Jun 06, 2014
    • witgo's avatar
      [SPARK-1841]: update scalatest to version 2.1.5 · 41c4a331
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #713 from witgo/scalatest and squashes the following commits:
      
      b627a6a [witgo] merge master
      51fb3d6 [witgo] merge master
      3771474 [witgo] fix RDDSuite
      996d6f9 [witgo] fix TimeStampedWeakValueHashMap test
      9dfa4e7 [witgo] merge bug
      1479b22 [witgo] merge master
      29b9194 [witgo] fix code style
      022a7a2 [witgo] fix test dependency
      a52c0fa [witgo] fix test dependency
      cd8f59d [witgo] Merge branch 'master' of https://github.com/apache/spark into scalatest
      046540d [witgo] fix RDDSuite.scala
      2c543b9 [witgo] fix ReplSuite.scala
      c458928 [witgo] update scalatest to version 2.1.5
      41c4a331
  7. Jun 05, 2014
    • Marcelo Vanzin's avatar
      Remove compile-scoped junit dependency. · 668cb1de
      Marcelo Vanzin authored
      This avoids having junit classes showing up in the assembly jar.
      I verified that only test classes in the jtransforms package
      use junit.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #794 from vanzin/junit-dep-exclusion and squashes the following commits:
      
      274e1c2 [Marcelo Vanzin] Remove junit from assembly in sbt build also.
      ad950be [Marcelo Vanzin] Remove compile-scoped junit dependency.
      668cb1de
    • Kalpit Shah's avatar
      sbt 0.13.X should be using sbt-assembly 0.11.X · 5473aa7c
      Kalpit Shah authored
      https://github.com/sbt/sbt-assembly/blob/master/README.md
      
      Author: Kalpit Shah <shahkalpit84@gmail.com>
      
      Closes #555 from kalpit/upgrade/sbtassembly and squashes the following commits:
      
      1fa7324 [Kalpit Shah] sbt 0.13.X should be using sbt-assembly 0.11.X
      5473aa7c
  8. Jun 04, 2014
    • Kan Zhang's avatar
      [SPARK-1817] RDD.zip() should verify partition sizes for each partition · c402a4a6
      Kan Zhang authored
      RDD.zip() will throw an exception if it finds partition sizes are not the same.
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #944 from kanzhang/SPARK-1817 and squashes the following commits:
      
      c073848 [Kan Zhang] [SPARK-1817] Cosmetic updates
      524c670 [Kan Zhang] [SPARK-1817] RDD.zip() should verify partition sizes for each partition
      c402a4a6
  9. Jun 03, 2014
    • Reynold Xin's avatar
      SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog. · 1faef149
      Reynold Xin authored
      I also corrected some errors made in the previous HLL count approximate API, including relativeSD wasn't really a measure for error (and we used it to test error bounds in test results).
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #897 from rxin/hll and squashes the following commits:
      
      4d83f41 [Reynold Xin] New error bound and non-randomness.
      f154ea0 [Reynold Xin] Added a comment on the value bound for testing.
      e367527 [Reynold Xin] One more round of code review.
      41e649a [Reynold Xin] Update final mima list.
      9e320c8 [Reynold Xin] Incorporate code review feedback.
      e110d70 [Reynold Xin] Merge branch 'master' into hll
      354deb8 [Reynold Xin] Added comment on the Mima exclude rules.
      acaa524 [Reynold Xin] Added the right exclude rules in MimaExcludes.
      6555bfe [Reynold Xin] Added a default method and re-arranged MimaExcludes.
      1db1522 [Reynold Xin] Excluded util.SerializableHyperLogLog from MIMA check.
      9221b27 [Reynold Xin] Merge branch 'master' into hll
      88cfe77 [Reynold Xin] Updated documentation and restored the old incorrect API to maintain API compatibility.
      1294be6 [Reynold Xin] Updated HLL+.
      e7786cb [Reynold Xin] Merge branch 'master' into hll
      c0ef0c2 [Reynold Xin] SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog.
      1faef149
    • Joseph E. Gonzalez's avatar
      Synthetic GraphX Benchmark · 894ecde0
      Joseph E. Gonzalez authored
      This PR accomplishes two things:
      
      1. It introduces a Synthetic Benchmark application that generates an arbitrarily large log-normal graph and executes either PageRank or connected components on the graph.  This can be used to profile GraphX system on arbitrary clusters without access to large graph datasets
      
      2. This PR improves the implementation of the log-normal graph generator.
      
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #720 from jegonzal/graphx_synth_benchmark and squashes the following commits:
      
      e40812a [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0
      bccccad [Ankur Dave] Fix long lines
      374678a [Ankur Dave] Bugfix and style changes
      1bdf39a [Joseph E. Gonzalez] updating options
      d943972 [Joseph E. Gonzalez] moving the benchmark application into the examples folder.
      f4f839a [Joseph E. Gonzalez] Creating a synthetic benchmark script.
      894ecde0
    • tzolov's avatar
      Add support for Pivotal HD in the Maven build: SPARK-1992 · b1f28535
      tzolov authored
      Allow Spark to build against particular Pivotal HD distributions. For example to build Spark against Pivotal HD 2.0.1 one can run:
      ```
      mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0-gphd-3.0.1.0 -DskipTests clean package
      ```
      
      Author: tzolov <christian.tzolov@gmail.com>
      
      Closes #942 from tzolov/master and squashes the following commits:
      
      bc3e05a [tzolov] Add support for Pivotal HD in the Maven build and SBT build: [SPARK-1992]
      b1f28535
  10. Jun 01, 2014
    • Patrick Wendell's avatar
      Better explanation for how to use MIMA excludes. · d17d2214
      Patrick Wendell authored
      This patch does a few things:
      1. We have a file MimaExcludes.scala exclusively for excludes.
      2. The test runner tells users about that file if a test fails.
      3. I've added back the excludes used from 0.9->1.0. We should keep
         these in the project as an official audit trail of times where
         we decided to make exceptions.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #937 from pwendell/mima and squashes the following commits:
      
      7ee0db2 [Patrick Wendell] Better explanation for how to use MIMA excludes.
      d17d2214
  11. May 31, 2014
    • Michael Armbrust's avatar
      Optionally include Hive as a dependency of the REPL. · 7463cd24
      Michael Armbrust authored
      Due to the way spark-shell launches from an assembly jar, I don't think this change will affect anyone who isn't trying to launch the shell directly from sbt.  That said, it is kinda nice to be able to launch all things directly from SBT when developing.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #801 from marmbrus/hiveRepl and squashes the following commits:
      
      9570571 [Michael Armbrust] Optionally include Hive as a dependency of the REPL.
      7463cd24
  12. May 30, 2014
    • Prashant Sharma's avatar
      [SPARK-1971] Update MIMA to compare against Spark 1.0.0 · 79fa8fd4
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #910 from ScrapCodes/enable-mima/spark-core and squashes the following commits:
      
      79f3687 [Prashant Sharma] updated Mima to check against version 1.0
      1e8969c [Prashant Sharma] Spark core missed out on Mima settings. So in effect we never tested spark core for mima related errors.
      79fa8fd4
  13. May 29, 2014
  14. May 19, 2014
  15. May 15, 2014
    • witgo's avatar
      fix different versions of commons-lang dependency and apache/spark#746 addendum · bae07e36
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #754 from witgo/commons-lang and squashes the following commits:
      
      3ebab31 [witgo] merge master
      f3b8fa2 [witgo] merge master
      2083fae [witgo] repeat definition
      5599cdb [witgo] multiple version of sbt  dependency
      c1b66a1 [witgo] fix different versions of commons-lang dependency
      bae07e36
  16. May 14, 2014
  17. May 12, 2014
    • Ankur Dave's avatar
      SPARK-1786: Reopening PR 724 · 0e2bde20
      Ankur Dave authored
      Addressing issue in MimaBuild.scala.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      
      Closes #742 from jegonzal/edge_partition_serialization and squashes the following commits:
      
      8ba6e0d [Ankur Dave] Add concatenation operators to MimaBuild.scala
      cb2ed3a [Joseph E. Gonzalez] addressing missing exclusion in MimaBuild.scala
      5d27824 [Ankur Dave] Disable reference tracking to fix serialization test
      c0a9ae5 [Ankur Dave] Add failing test for EdgePartition Kryo serialization
      a4a3faa [Joseph E. Gonzalez] Making EdgePartition serializable.
      0e2bde20
    • Bernardo Gomez Palacio's avatar
      SPARK-1806: Upgrade Mesos dependency to 0.18.1 · d9c97ba3
      Bernardo Gomez Palacio authored
      Enabled Mesos (0.18.1) dependency with shaded protobuf
      
      Why is this needed?
      Avoids any protobuf version collision between Mesos and any other
      dependency in Spark e.g. Hadoop HDFS 2.2+ or 1.0.4.
      
      Ticket: https://issues.apache.org/jira/browse/SPARK-1806
      
      * Should close https://issues.apache.org/jira/browse/SPARK-1433
      
      Author berngp
      
      Author: Bernardo Gomez Palacio <bernardo.gomezpalacio@gmail.com>
      
      Closes #741 from berngp/feature/SPARK-1806 and squashes the following commits:
      
      5d70646 [Bernardo Gomez Palacio] SPARK-1806: Upgrade Mesos dependency to 0.18.1
      d9c97ba3
  18. May 10, 2014
    • Prashant Sharma's avatar
      Enabled incremental build that comes with sbt 0.13.2 · 70bcdef4
      Prashant Sharma authored
      More info at. https://github.com/sbt/sbt/issues/1010
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #525 from ScrapCodes/sbt-inc-opt and squashes the following commits:
      
      ba8fa42 [Prashant Sharma] Enabled incremental build that comes with sbt 0.13.2
      70bcdef4
    • Sean Owen's avatar
      SPARK-1789. Multiple versions of Netty dependencies cause FlumeStreamSuite failure · 2b7bd29e
      Sean Owen authored
      TL;DR is there is a bit of JAR hell trouble with Netty, that can be mostly resolved and will resolve a test failure.
      
      I hit the error described at http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-td1753.html while running FlumeStreamingSuite, and have for a short while (is it just me?)
      
      velvia notes:
      "I have found a workaround.  If you add akka 2.2.4 to your dependencies, then everything works, probably because akka 2.2.4 brings in newer version of Jetty."
      
      There are at least 3 versions of Netty in play in the build:
      
      - the new Flume 1.4.0 dependency brings in io.netty:netty:3.4.0.Final, and that is the immediate problem
      - the custom version of akka 2.2.3 depends on io.netty:netty:3.6.6.
      - but, Spark Core directly uses io.netty:netty-all:4.0.17.Final
      
      The POMs try to exclude other versions of netty, but are excluding org.jboss.netty:netty, when in fact older versions of io.netty:netty (not netty-all) are also an issue.
      
      The org.jboss.netty:netty excludes are largely unnecessary. I replaced many of them with io.netty:netty exclusions until everything agreed on io.netty:netty-all:4.0.17.Final.
      
      But this didn't work, since Akka 2.2.3 doesn't work with Netty 4.x. Down-grading to 3.6.6.Final across the board made some Spark code not compile.
      
      If the build *keeps* io.netty:netty:3.6.6.Final as well, everything seems to work. Part of the reason seems to be that Netty 3.x used the old `org.jboss.netty` packages. This is less than ideal, but is no worse than the current situation.
      
      So this PR resolves the issue and improves the JAR hell, even if it leaves the existing theoretical Netty 3-vs-4 conflict:
      
      - Remove org.jboss.netty excludes where possible, for clarity; they're not needed except with Hadoop artifacts
      - Add io.netty:netty excludes where needed -- except, let akka keep its io.netty:netty
      - Change a bit of test code that actually depended on Netty 3.x, to use 4.x equivalent
      - Update SBT build accordingly
      
      A better change would be to update Akka far enough such that it agrees on Netty 4.x, but I don't know if that's feasible.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #723 from srowen/SPARK-1789 and squashes the following commits:
      
      43661b7 [Sean Owen] Update and add Netty excludes to prevent some JAR conflicts that cause test issues
      2b7bd29e
    • Ankur Dave's avatar
      Unify GraphImpl RDDs + other graph load optimizations · 905173df
      Ankur Dave authored
      This PR makes the following changes, primarily in e4fbd329aef85fe2c38b0167255d2a712893d683:
      
      1. *Unify RDDs to avoid zipPartitions.* A graph used to be four RDDs: vertices, edges, routing table, and triplet view. This commit merges them down to two: vertices (with routing table), and edges (with replicated vertices).
      
      2. *Avoid duplicate shuffle in graph building.* We used to do two shuffles when building a graph: one to extract routing information from the edges and move it to the vertices, and another to find nonexistent vertices referred to by edges. With this commit, the latter is done as a side effect of the former.
      
      3. *Avoid no-op shuffle when joins are fully eliminated.* This is a side effect of unifying the edges and the triplet view.
      
      4. *Join elimination for mapTriplets.*
      
      5. *Ship only the needed vertex attributes when upgrading the triplet view.* If the triplet view already contains source attributes, and we now need both attributes, only ship destination attributes rather than re-shipping both. This is done in `ReplicatedVertexView#upgrade`.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #497 from ankurdave/unify-rdds and squashes the following commits:
      
      332ab43 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into unify-rdds
      4933e2e [Ankur Dave] Exclude RoutingTable from binary compatibility check
      5ba8789 [Ankur Dave] Add GraphX upgrade guide from Spark 0.9.1
      13ac845 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into unify-rdds
      a04765c [Ankur Dave] Remove unnecessary toOps call
      57202e8 [Ankur Dave] Replace case with pair parameter
      75af062 [Ankur Dave] Add explicit return types
      04d3ae5 [Ankur Dave] Convert implicit parameter to context bound
      c88b269 [Ankur Dave] Revert upgradeIterator to if-in-a-loop
      0d3584c [Ankur Dave] EdgePartition.size should be val
      2a928b2 [Ankur Dave] Set locality wait
      10b3596 [Ankur Dave] Clean up public API
      ae36110 [Ankur Dave] Fix style errors
      e4fbd32 [Ankur Dave] Unify GraphImpl RDDs + other graph load optimizations
      d6d60e2 [Ankur Dave] In GraphLoader, coalesce to minEdgePartitions
      62c7b78 [Ankur Dave] In Analytics, take PageRank numIter
      d64e8d4 [Ankur Dave] Log current Pregel iteration
      905173df
    • Michael Armbrust's avatar
      [SQL] Upgrade parquet library. · 4d605532
      Michael Armbrust authored
      I think we are hitting this issue in some perf tests: https://github.com/Parquet/parquet-mr/commit/6aed5288fd4a1398063a5a219b2ae4a9f71b02cf
      
      Credit to @aarondav !
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #684 from marmbrus/upgradeParquet and squashes the following commits:
      
      e10a619 [Michael Armbrust] Upgrade parquet library.
      4d605532
    • witgo's avatar
      [SPARK-1644] The org.datanucleus:* should not be packaged into spark-assembly-*.jar · 56151086
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #688 from witgo/SPARK-1644 and squashes the following commits:
      
      56ad6ac [witgo] review commit
      87c03e4 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1644
      6ffa7e4 [witgo] review commit
      a597414 [witgo] The org.datanucleus:* should not be packaged into spark-assembly-*.jar
      56151086
  19. May 07, 2014
    • Kan Zhang's avatar
      [SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations... · 967635a2
      Kan Zhang authored
      ... that do not change schema
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #448 from kanzhang/SPARK-1460 and squashes the following commits:
      
      111e388 [Kan Zhang] silence MiMa errors in EdgeRDD and VertexRDD
      91dc787 [Kan Zhang] Taking into account newly added Ordering param
      79ed52a [Kan Zhang] [SPARK-1460] Returning SchemaRDD on Set operations that do not change schema
      967635a2
  20. May 06, 2014
    • Matei Zaharia's avatar
      [SPARK-1549] Add Python support to spark-submit · 951a5d93
      Matei Zaharia authored
      This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.
      
      This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.
      
      In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.
      
      In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #664 from mateiz/py-submit and squashes the following commits:
      
      15e9669 [Matei Zaharia] Fix some uses of path.separator property
      051278c [Matei Zaharia] Small style fixes
      0afe886 [Matei Zaharia] Add license headers
      4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests
      15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside
      47c0655 [Matei Zaharia] More work to make spark-submit work with Python:
      d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones
      951a5d93
    • Thomas Graves's avatar
      SPARK-1474: Spark on yarn assembly doesn't include AmIpFilter · 1e829905
      Thomas Graves authored
      We use org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter in spark on yarn but are not included it in the assembly jar.
      
      I tested this on yarn cluster by removing the yarn jars from the classpath and spark runs fine now.
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #406 from tgravescs/SPARK-1474 and squashes the following commits:
      
      1548bf9 [Thomas Graves] SPARK-1474: Spark on yarn assembly doesn't include org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
      1e829905
  21. May 05, 2014
    • Sean Owen's avatar
      SPARK-1556. jets3t dep doesn't update properly with newer Hadoop versions · 73b0cbcc
      Sean Owen authored
      See related discussion at https://github.com/apache/spark/pull/468
      
      This PR may still overstep what you have in mind, but let me put it on the table to start. Besides fixing the issue, it has one substantive change, and that is to manage Hadoop-specific things only in Hadoop-related profiles. This does _not_ remove `yarn.version`.
      
      - Moves the YARN and Hadoop profiles together in pom.xml. Sorry that this makes the diff a little hard to grok but the changes are only as follows.
      - Removes `hadoop.major.version`
      - Introduce `hadoop-2.2` and `hadoop-2.3` profiles to control Hadoop-specific changes:
        - like the protobuf version issue - this was only 'solved' now by enabling YARN for 2.2+, which is really an orthogonal issue
        - like the jets3t version issue now
      - Hadoop profiles set an appropriate default `hadoop.version`, that can be overridden
      - _(YARN profiles in the parent now only exist to add the sub-module)_
      - Fixes the jets3t dependency issue
       - and makes it a runtime dependency
       - and centralizes config of this guy in the parent pom
      - Updates build docs
      - Updates SBT build too
        - and fixes a regex problem along the way
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #629 from srowen/SPARK-1556 and squashes the following commits:
      
      c3fa967 [Sean Owen] Fix hadoop-2.4 profile typo in doc
      a2105fd [Sean Owen] Add hadoop-2.4 profile and don't set hadoop.version in profiles
      274f4f9 [Sean Owen] Make jets3t a runtime dependency, and bring its exclusion up into parent config
      bbed826 [Sean Owen] Use jets3t 0.9.0 for Hadoop 2.3+ (and correct similar regex issue in SBT build)
      f21f356 [Sean Owen] Build changes to set up for jets3t fix
      73b0cbcc
  22. May 04, 2014
    • Sean Owen's avatar
      SPARK-1629. Addendum: Depend on commons lang3 (already used by tachyon) as... · f5041579
      Sean Owen authored
      SPARK-1629. Addendum: Depend on commons lang3 (already used by tachyon) as it's used in ReplSuite, and return to use lang3 utility in Utils.scala
      
      For consideration. This was proposed in related discussion: https://github.com/apache/spark/pull/569
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #635 from srowen/SPARK-1629.2 and squashes the following commits:
      
      a442b98 [Sean Owen] Depend on commons lang3 (already used by tachyon) as it's used in ReplSuite, and return to use lang3 utility in Utils.scala
      f5041579
  23. Apr 29, 2014
    • Xiangrui Meng's avatar
      [SPARK-1636][MLLIB] Move main methods to examples · 3f38334f
      Xiangrui Meng authored
      * `NaiveBayes` -> `SparseNaiveBayes`
      * `KMeans` -> `DenseKMeans`
      * `SVMWithSGD` and `LogisticRegerssionWithSGD` -> `BinaryClassification`
      * `ALS` -> `MovieLensALS`
      * `LinearRegressionWithSGD`, `LassoWithSGD`, and `RidgeRegressionWithSGD` -> `LinearRegression`
      * `DecisionTree` -> `DecisionTreeRunner`
      
      `scopt` is used for parsing command-line parameters. `scopt` has MIT license and it only depends on `scala-library`.
      
      Example help message:
      
      ~~~
      BinaryClassification: an example app for binary classification.
      Usage: BinaryClassification [options] <input>
      
        --numIterations <value>
              number of iterations
        --stepSize <value>
              initial step size, default: 1.0
        --algorithm <value>
              algorithm (SVM,LR), default: LR
        --regType <value>
              regularization type (L1,L2), default: L2
        --regParam <value>
              regularization parameter, default: 0.1
        <input>
              input paths to labeled examples in LIBSVM format
      ~~~
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #584 from mengxr/mllib-main and squashes the following commits:
      
      7b58c60 [Xiangrui Meng] minor
      6e35d7e [Xiangrui Meng] make imports explicit and fix code style
      c6178c9 [Xiangrui Meng] update TS PCA/SVD to use new spark-submit
      6acff75 [Xiangrui Meng] use scopt for DecisionTreeRunner
      be86069 [Xiangrui Meng] use main instead of extending App
      b3edf68 [Xiangrui Meng] move DecisionTree's main method to examples
      8bfaa5a [Xiangrui Meng] change NaiveBayesParams to Params
      fe23dcb [Xiangrui Meng] remove main from KMeans and add DenseKMeans as an example
      67f4448 [Xiangrui Meng] remove main methods from linear regression algorithms and add LinearRegression example
      b066bbc [Xiangrui Meng] remove main from ALS and add MovieLensALS example
      b040f3b [Xiangrui Meng] change BinaryClassificationParams to Params
      577945b [Xiangrui Meng] remove unused imports from NB
      3d299bc [Xiangrui Meng] remove main from LR/SVM and add an example app for binary classification
      f70878e [Xiangrui Meng] remove main from NaiveBayes and add an example NaiveBayes app
      01ec2cd [Xiangrui Meng] Merge branch 'master' into mllib-main
      9420692 [Xiangrui Meng] add scopt to examples dependencies
      3f38334f
    • witgo's avatar
      Improved build configuration · 030f2c21
      witgo authored
      1, Fix SPARK-1441: compile spark core error with hadoop 0.23.x
      2, Fix SPARK-1491: maven hadoop-provided profile fails to build
      3, Fix org.scala-lang: * ,org.apache.avro:* inconsistent versions dependency
      4, A modified on the sql/catalyst/pom.xml,sql/hive/pom.xml,sql/core/pom.xml (Four spaces formatted into two spaces)
      
      Author: witgo <witgo@qq.com>
      
      Closes #480 from witgo/format_pom and squashes the following commits:
      
      03f652f [witgo] review commit
      b452680 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
      bee920d [witgo] revert fix SPARK-1629: Spark Core missing commons-lang dependence
      7382a07 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
      6902c91 [witgo] fix SPARK-1629: Spark Core missing commons-lang dependence
      0da4bc3 [witgo] merge master
      d1718ed [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
      e345919 [witgo] add avro dependency to yarn-alpha
      77fad08 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
      62d0862 [witgo] Fix org.scala-lang: * inconsistent versions dependency
      1a162d7 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
      934f24d [witgo] review commit
      cf46edc [witgo] exclude jruby
      06e7328 [witgo] Merge branch 'SparkBuild' into format_pom
      99464d2 [witgo] fix maven hadoop-provided profile fails to build
      0c6c1fc [witgo] Fix compile spark core error with hadoop 0.23.x
      6851bec [witgo] Maintain consistent SparkBuild.scala, pom.xml
      030f2c21
  24. Apr 28, 2014
    • Cheng Hao's avatar
      Update the import package name for TestHive in sbt shell · ea01affc
      Cheng Hao authored
      sbt/sbt hive/console will fail as TestHive changed its package from "org.apache.spark.sql.hive" to "org.apache.spark.sql.hive.test".
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #574 from chenghao-intel/hive_console and squashes the following commits:
      
      de14035 [Cheng Hao] Update the import package name for TestHive in sbt shell
      ea01affc
  25. Apr 25, 2014
    • Matei Zaharia's avatar
      SPARK-1621 Upgrade Chill to 0.3.6 · a24d918c
      Matei Zaharia authored
      It registers more Scala classes, including things like Ranges that we had to register manually before. See https://github.com/twitter/chill/releases for Chill's change log.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #543 from mateiz/chill-0.3.6 and squashes the following commits:
      
      a1dc5e0 [Matei Zaharia] Upgrade Chill to 0.3.6 and remove our special registration of Ranges
      a24d918c
Loading