Skip to content
Snippets Groups Projects
  1. Nov 12, 2014
    • Aaron Davidson's avatar
      [SPARK-4370] [Core] Limit number of Netty cores based on executor size · b9e1c2eb
      Aaron Davidson authored
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3155 from aarondav/conf and squashes the following commits:
      
      7045e77 [Aaron Davidson] Add mesos comment
      4770f6e [Aaron Davidson] [SPARK-4370] [Core] Limit number of Netty cores based on executor size
      b9e1c2eb
    • Xiangrui Meng's avatar
      [SPARK-4373][MLLIB] fix MLlib maven tests · 23f5bdf0
      Xiangrui Meng authored
      We want to make sure there is at most one spark context inside the same jvm. JoshRosen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3235 from mengxr/SPARK-4373 and squashes the following commits:
      
      6574b69 [Xiangrui Meng] rename LocalSparkContext to MLlibTestSparkContext
      913d48d [Xiangrui Meng] make sure there is at most one spark context inside the same jvm
      23f5bdf0
    • Andrew Or's avatar
      [Release] Bring audit scripts up-to-date · 723a86b0
      Andrew Or authored
      This involves a few main changes:
      - Log all output message to the log file. Previously the log file
        was not useful because it did not indicate progress.
      - Remove hive-site.xml in sbt_hive_app to avoid interference
      - Add the appropriate repositories for new dependencies
      723a86b0
    • Davies Liu's avatar
      [SPARK-2672] support compressed file in wholeTextFile · d7d54a44
      Davies Liu authored
      The wholeFile() can not read compressed files, it should be, just like textFile().
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3005 from davies/whole and squashes the following commits:
      
      a43fcfb [Davies Liu] remove semicolon
      c83571a [Davies Liu] remove = if return type is Unit
      83c844f [Davies Liu] Merge branch 'master' of github.com:apache/spark into whole
      22e8b3e [Davies Liu] support compressed file in wholeTextFile
      d7d54a44
    • Davies Liu's avatar
      [SPARK-4369] [MLLib] fix TreeModel.predict() with RDD · bd86118c
      Davies Liu authored
      Fix  TreeModel.predict() with RDD, added tests for it.
      
      (Also checked that other models don't have this issue)
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3230 from davies/predict and squashes the following commits:
      
      81172aa [Davies Liu] fix predict
      bd86118c
    • Ankur Dave's avatar
      [SPARK-3666] Extract interfaces for EdgeRDD and VertexRDD · a5ef5811
      Ankur Dave authored
      This discourages users from calling the VertexRDD and EdgeRDD constructor and makes it easier for future changes to ensure backward compatibility.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #2530 from ankurdave/SPARK-3666 and squashes the following commits:
      
      d681f45 [Ankur Dave] Define getPartitions and compute in abstract class for MIMA
      1472390 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into SPARK-3666
      24201d4 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into SPARK-3666
      cbe15f2 [Ankur Dave] Remove specialized annotation from VertexRDD and EdgeRDD
      931b587 [Ankur Dave] Use abstract class instead of trait for binary compatibility
      9ba4ec4 [Ankur Dave] Mark (Vertex|Edge)RDDImpl constructors package-private
      620e603 [Ankur Dave] Extract VertexRDD interface and move implementation to VertexRDDImpl
      55b6398 [Ankur Dave] Extract EdgeRDD interface and move implementation to EdgeRDDImpl
      a5ef5811
    • Andrew Or's avatar
      c3afd326
    • Ankur Dave's avatar
      Internal cleanup for aggregateMessages · 0402be90
      Ankur Dave authored
      1. Add EdgeActiveness enum to represent activeness criteria more cleanly than using booleans.
      2. Comments and whitespace.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #3231 from ankurdave/aggregateMessages-followup and squashes the following commits:
      
      3d485c3 [Ankur Dave] Internal cleanup for aggregateMessages
      0402be90
    • Andrew Or's avatar
      [SPARK-4281][Build] Package Yarn shuffle service into its own jar · aa43a8da
      Andrew Or authored
      This is another addendum to #3082, which added the Yarn shuffle service to run inside the NM. This PR makes the feature much more usable by packaging enough dependencies into the jar to run the service inside an NM. After these changes, the user can run `./make-distribution.sh` and find a `spark-network-yarn*.jar` in their `lib` directory. The equivalent change is done in SBT by making the `network-yarn` module an assembly project.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3147 from andrewor14/yarn-shuffle-build and squashes the following commits:
      
      bda58d0 [Andrew Or] Fix line too long
      81e9705 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-build
      fb7f398 [Andrew Or] Rename jar to spark-{VERSION}-yarn-shuffle.jar
      65db822 [Andrew Or] Actually mark slf4j as provided
      abcefd1 [Andrew Or] Do the same for SBT
      c653028 [Andrew Or] Package network-yarn and its dependencies
      aa43a8da
    • Andrew Or's avatar
      [Test] Better exception message from SparkSubmitSuite · 6e3c5a29
      Andrew Or authored
      Before:
      ```
      Exception in thread "main" java.lang.Exception: Could not load user defined classes inside of executors
      	at org.apache.spark.deploy.JarCreationTest$.main(SparkSubmitSuite.scala:471)
      	at org.apache.spark.deploy.JarCreationTest.main(SparkSubmitSuite.scala)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      ```
      After:
      ```
      Exception in thread "main" java.lang.Exception: Could not load user class from jar:
      java.lang.UnsupportedClassVersionError: SparkSubmitClassA : Unsupported major.minor version 51.0
      	java.lang.ClassLoader.defineClass1(Native Method)
      	java.lang.ClassLoader.defineClass(ClassLoader.java:643)
      	...
      	at org.apache.spark.deploy.JarCreationTest$.main(SparkSubmitSuite.scala:472)
      	at org.apache.spark.deploy.JarCreationTest.main(SparkSubmitSuite.scala)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      ```
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3212 from andrewor14/submit-suite-message and squashes the following commits:
      
      7779248 [Andrew Or] Format exception
      8fe6719 [Andrew Or] Better exception message from failed test
      6e3c5a29
    • Soumitra Kumar's avatar
      [SPARK-3660][STREAMING] Initial RDD for updateStateByKey transformation · 36ddeb7b
      Soumitra Kumar authored
      SPARK-3660 : Initial RDD for updateStateByKey transformation
      
      I have added a sample StatefulNetworkWordCountWithInitial inspired by StatefulNetworkWordCount.
      
      Please let me know if any changes are required.
      
      Author: Soumitra Kumar <kumar.soumitra@gmail.com>
      
      Closes #2665 from soumitrak/master and squashes the following commits:
      
      ee8980b [Soumitra Kumar] Fixed copy/paste issue.
      304f636 [Soumitra Kumar] Added simpler version of updateStateByKey API with initialRDD and test.
      9781135 [Soumitra Kumar] Fixed test, and renamed variable.
      3da51a2 [Soumitra Kumar] Adding updateStateByKey with initialRDD API to JavaPairDStream.
      2f78f7e [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
      d4fdd18 [Soumitra Kumar] Renamed variable and moved method.
      d0ce2cd [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
      31399a4 [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
      4efa58b [Soumitra Kumar] [SPARK-3660][STREAMING] Initial RDD for updateStateByKey transformation
      8f40ca0 [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
      dde4271 [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
      fdd7db3 [Soumitra Kumar] Adding support of initial value for state update. SPARK-3660 : Initial RDD for updateStateByKey transformation
      36ddeb7b
    • Xiangrui Meng's avatar
      [SPARK-3530][MLLIB] pipeline and parameters with examples · 4b736dba
      Xiangrui Meng authored
      This PR adds package "org.apache.spark.ml" with pipeline and parameters, as discussed on the JIRA. This is a joint work of jkbradley etrain shivaram and many others who helped on the design, also with help from  marmbrus and liancheng on the Spark SQL side. The design doc can be found at:
      
      https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
      
      **org.apache.spark.ml**
      
      This is a new package with new set of ML APIs that address practical machine learning pipelines. (Sorry for taking so long!) It will be an alpha component, so this is definitely not something set in stone. The new set of APIs, inspired by the MLI project from AMPLab and scikit-learn, takes leverage on Spark SQL's schema support and execution plan optimization. It introduces the following components that help build a practical pipeline:
      
      1. Transformer, which transforms a dataset into another
      2. Estimator, which fits models to data, where models are transformers
      3. Evaluator, which evaluates model output and returns a scalar metric
      4. Pipeline, a simple pipeline that consists of transformers and estimators
      
      Parameters could be supplied at fit/transform or embedded with components.
      
      1. Param: a strong-typed parameter key with self-contained doc
      2. ParamMap: a param -> value map
      3. Params: trait for components with parameters
      
      For any component that implements `Params`, user can easily check the doc by calling `explainParams`:
      
      ~~~
      > val lr = new LogisticRegression
      > lr.explainParams
      maxIter: max number of iterations (default: 100)
      regParam: regularization constant (default: 0.1)
      labelCol: label column name (default: label)
      featuresCol: features column name (default: features)
      ~~~
      
      or user can check individual param:
      
      ~~~
      > lr.maxIter
      maxIter: max number of iterations (default: 100)
      ~~~
      
      **Please start with the example code in test suites and under `org.apache.spark.examples.ml`, where I put several examples:**
      
      1. run a simple logistic regression job
      
      ~~~
          val lr = new LogisticRegression()
            .setMaxIter(10)
            .setRegParam(1.0)
          val model = lr.fit(dataset)
          model.transform(dataset, model.threshold -> 0.8) // overwrite threshold
            .select('label, 'score, 'prediction).collect()
            .foreach(println)
      ~~~
      
      2. run logistic regression with cross-validation and grid search using areaUnderROC (default) as the metric
      
      ~~~
          val lr = new LogisticRegression
          val lrParamMaps = new ParamGridBuilder()
            .addGrid(lr.regParam, Array(0.1, 100.0))
            .addGrid(lr.maxIter, Array(0, 5))
            .build()
          val eval = new BinaryClassificationEvaluator
          val cv = new CrossValidator()
            .setEstimator(lr)
            .setEstimatorParamMaps(lrParamMaps)
            .setEvaluator(eval)
            .setNumFolds(3)
          val bestModel = cv.fit(dataset)
      ~~~
      
      3. run a pipeline that consists of a standard scaler and a logistic regression component
      
      ~~~
          val scaler = new StandardScaler()
            .setInputCol("features")
            .setOutputCol("scaledFeatures")
          val lr = new LogisticRegression()
            .setFeaturesCol(scaler.getOutputCol)
          val pipeline = new Pipeline()
            .setStages(Array(scaler, lr))
          val model = pipeline.fit(dataset)
          val predictions = model.transform(dataset)
            .select('label, 'score, 'prediction)
            .collect()
            .foreach(println)
      ~~~
      
      4. a simple text classification pipeline, which recognizes "spark":
      
      ~~~
          val training = sparkContext.parallelize(Seq(
            LabeledDocument(0L, "a b c d e spark", 1.0),
            LabeledDocument(1L, "b d", 0.0),
            LabeledDocument(2L, "spark f g h", 1.0),
            LabeledDocument(3L, "hadoop mapreduce", 0.0)))
          val tokenizer = new Tokenizer()
            .setInputCol("text")
            .setOutputCol("words")
          val hashingTF = new HashingTF()
            .setInputCol(tokenizer.getOutputCol)
            .setOutputCol("features")
          val lr = new LogisticRegression()
            .setMaxIter(10)
          val pipeline = new Pipeline()
            .setStages(Array(tokenizer, hashingTF, lr))
          val model = pipeline.fit(training)
          val test = sparkContext.parallelize(Seq(
            Document(4L, "spark i j k"),
            Document(5L, "l m"),
            Document(6L, "mapreduce spark"),
            Document(7L, "apache hadoop")))
          model.transform(test)
            .select('id, 'text, 'prediction, 'score)
            .collect()
            .foreach(println)
      ~~~
      
      Java examples are very similar. I put example code that creates a simple text classification pipeline in Scala and Java, where a simple tokenizer is defined as a transformer outside `org.apache.spark.ml`.
      
      **What are missing now and will be added soon:**
      
      1. ~~Runtime check of schemas. So before we touch the data, we will go through the schema and make sure column names and types match the input parameters.~~
      2. ~~Java examples.~~
      3. ~~Store training parameters in trained models.~~
      4. (later) Serialization and Python API.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3099 from mengxr/SPARK-3530 and squashes the following commits:
      
      2cc93fd [Xiangrui Meng] hide APIs as much as I can
      34319ba [Xiangrui Meng] use local instead local[2] for unit tests
      2524251 [Xiangrui Meng] rename PipelineStage.transform to transformSchema
      c9daab4 [Xiangrui Meng] remove mockito version
      1397ab5 [Xiangrui Meng] use sqlContext from LocalSparkContext instead of TestSQLContext
      6ffc389 [Xiangrui Meng] try to fix unit test
      a59d8b7 [Xiangrui Meng] doc updates
      977fd9d [Xiangrui Meng] add scala ml package object
      6d97fe6 [Xiangrui Meng] add AlphaComponent annotation
      731f0e4 [Xiangrui Meng] update package doc
      0435076 [Xiangrui Meng] remove ;this from setters
      fa21d9b [Xiangrui Meng] update extends indentation
      f1091b3 [Xiangrui Meng] typo
      228a9f4 [Xiangrui Meng] do not persist before calling binary classification metrics
      f51cd27 [Xiangrui Meng] rename default to defaultValue
      b3be094 [Xiangrui Meng] refactor schema transform in lr
      8791e8e [Xiangrui Meng] rename copyValues to inheritValues and make it do the right thing
      51f1c06 [Xiangrui Meng] remove leftover code in Transformer
      494b632 [Xiangrui Meng] compure score once
      ad678e9 [Xiangrui Meng] more doc for Transformer
      4306ed4 [Xiangrui Meng] org imports in text pipeline
      6e7c1c7 [Xiangrui Meng] update pipeline
      4f9e34f [Xiangrui Meng] more doc for pipeline
      aa5dbd4 [Xiangrui Meng] fix typo
      11be383 [Xiangrui Meng] fix unit tests
      3df7952 [Xiangrui Meng] clean up
      986593e [Xiangrui Meng] re-org java test suites
      2b11211 [Xiangrui Meng] remove external data deps
      9fd4933 [Xiangrui Meng] add unit test for pipeline
      2a0df46 [Xiangrui Meng] update tests
      2d52e4d [Xiangrui Meng] add @AlphaComponent to package-info
      27582a4 [Xiangrui Meng] doc changes
      73a000b [Xiangrui Meng] add schema transformation layer
      6736e87 [Xiangrui Meng] more doc / remove HasMetricName trait
      80a8b5e [Xiangrui Meng] rename SimpleTransformer to UnaryTransformer
      62ca2bb [Xiangrui Meng] check param parent in set/get
      1622349 [Xiangrui Meng] add getModel to PipelineModel
      a0e0054 [Xiangrui Meng] update StandardScaler to use SimpleTransformer
      d0faa04 [Xiangrui Meng] remove implicit mapping from ParamMap
      c7f6921 [Xiangrui Meng] move ParamGridBuilder test to ParamGridBuilderSuite
      e246f29 [Xiangrui Meng] re-org:
      7772430 [Xiangrui Meng] remove modelParams add a simple text classification pipeline
      b95c408 [Xiangrui Meng] remove implicits add unit tests to params
      bab3e5b [Xiangrui Meng] update params
      fe0ee92 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3530
      6e86d98 [Xiangrui Meng] some code clean-up
      2d040b3 [Xiangrui Meng] implement setters inside each class, add Params.copyValues [ci skip]
      fd751fc [Xiangrui Meng] add java-friendly versions of fit and tranform
      3f810cd [Xiangrui Meng] use multi-model training api in cv
      5b8f413 [Xiangrui Meng] rename model to modelParams
      9d2d35d [Xiangrui Meng] test varargs and chain model params
      f46e927 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3530
      1ef26e0 [Xiangrui Meng] specialize methods/types for Java
      df293ed [Xiangrui Meng] switch to setter/getter
      376db0a [Xiangrui Meng] pipeline and parameters
      4b736dba
    • Xiangrui Meng's avatar
      [SPARK-4355][MLLIB] fix OnlineSummarizer.merge when other.mean is zero · 84324fbc
      Xiangrui Meng authored
      See inline comment about the bug. I also did some code clean-up. dbtsai I moved `update` to a private method of `MultivariateOnlineSummarizer`. I don't think it will cause performance regression, but it would be great if you have some time to test.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3220 from mengxr/SPARK-4355 and squashes the following commits:
      
      5ef601f [Xiangrui Meng] fix OnlineSummarizer.merge when other.mean is zero and some code clean-up
      84324fbc
    • Ankur Dave's avatar
      [SPARK-3936] Add aggregateMessages, which supersedes mapReduceTriplets · faeb41de
      Ankur Dave authored
      aggregateMessages enables neighborhood computation similarly to mapReduceTriplets, but it introduces two API improvements:
      
      1. Messages are sent using an imperative interface based on EdgeContext rather than by returning an iterator of messages.
      
      2. Rather than attempting bytecode inspection, the required triplet fields must be explicitly specified by the user by passing a TripletFields object. This fixes SPARK-3936.
      
      Additionally, this PR includes the following optimizations for aggregateMessages and EdgePartition:
      
      1. EdgePartition now stores local vertex ids instead of global ids. This avoids hash lookups when looking up vertex attributes and aggregating messages.
      
      2. Internal iterators in aggregateMessages are inlined into a while loop.
      
      In total, these optimizations were tested to provide a 37% speedup on PageRank (uk-2007-05 graph, 10 iterations, 16 r3.2xlarge machines, sped up from 513 s to 322 s).
      
      Subsumes apache/spark#2815. Also fixes SPARK-4173.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #3100 from ankurdave/aggregateMessages and squashes the following commits:
      
      f5b65d0 [Ankur Dave] Address @rxin comments on apache/spark#3054 and apache/spark#3100
      1e80aca [Ankur Dave] Add aggregateMessages, which supersedes mapReduceTriplets
      194a2df [Ankur Dave] Test triplet iterator in EdgePartition serialization test
      e0f8ecc [Ankur Dave] Take activeSet in ExistingEdgePartitionBuilder
      c85076d [Ankur Dave] Readability improvements
      b567be2 [Ankur Dave] iter.foreach -> while loop
      4a566dc [Ankur Dave] Optimizations for mapReduceTriplets and EdgePartition
      faeb41de
    • Manish Amde's avatar
      [MLLIB] SPARK-4347: Reducing GradientBoostingSuite run time. · 2ef016b1
      Manish Amde authored
      Before:
      [info] GradientBoostingSuite:
      [info] - Regression with continuous features: SquaredError (22 seconds, 115 milliseconds)
      [info] - Regression with continuous features: Absolute Error (19 seconds, 330 milliseconds)
      [info] - Binary classification with continuous features: Log Loss (19 seconds, 17 milliseconds)
      
      After:
      [info] - Regression with continuous features: SquaredError (7 seconds, 69 milliseconds)
      [info] - Regression with continuous features: Absolute Error (4 seconds, 617 milliseconds)
      [info] - Binary classification with continuous features: Log Loss (4 seconds, 658 milliseconds)
      
      cc: mengxr, jkbradley
      
      Author: Manish Amde <manish9ue@gmail.com>
      
      Closes #3214 from manishamde/gbt_test_speedup and squashes the following commits:
      
      8994552 [Manish Amde] reducing gbt test run times
      2ef016b1
  2. Nov 11, 2014
    • Prashant Sharma's avatar
      Support cross building for Scala 2.11 · daaca14c
      Prashant Sharma authored
      Let's give this another go using a version of Hive that shades its JLine dependency.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #3159 from pwendell/scala-2.11-prashant and squashes the following commits:
      
      e93aa3e [Patrick Wendell] Restoring -Phive-thriftserver profile and cleaning up build script.
      f65d17d [Patrick Wendell] Fixing build issue due to merge conflict
      a8c41eb [Patrick Wendell] Reverting dev/run-tests back to master state.
      7a6eb18 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into scala-2.11-prashant
      583aa07 [Prashant Sharma] REVERT ME: removed hive thirftserver
      3680e58 [Prashant Sharma] Revert "REVERT ME: Temporarily removing some Cli tests."
      935fb47 [Prashant Sharma] Revert "Fixed by disabling a few tests temporarily."
      925e90f [Prashant Sharma] Fixed by disabling a few tests temporarily.
      2fffed3 [Prashant Sharma] Exclude groovy from sbt build, and also provide a way for such instances in future.
      8bd4e40 [Prashant Sharma] Switched to gmaven plus, it fixes random failures observer with its predecessor gmaven.
      5272ce5 [Prashant Sharma] SPARK_SCALA_VERSION related bugs.
      2121071 [Patrick Wendell] Migrating version detection to PySpark
      b1ed44d [Patrick Wendell] REVERT ME: Temporarily removing some Cli tests.
      1743a73 [Patrick Wendell] Removing decimal test that doesn't work with Scala 2.11
      f5cad4e [Patrick Wendell] Add Scala 2.11 docs
      210d7e1 [Patrick Wendell] Revert "Testing new Hive version with shaded jline"
      48518ce [Patrick Wendell] Remove association of Hive and Thriftserver profiles.
      e9d0a06 [Patrick Wendell] Revert "Enable thritfserver for Scala 2.10 only"
      67ec364 [Patrick Wendell] Guard building of thriftserver around Scala 2.10 check
      8502c23 [Patrick Wendell] Enable thritfserver for Scala 2.10 only
      e22b104 [Patrick Wendell] Small fix in pom file
      ec402ab [Patrick Wendell] Various fixes
      0be5a9d [Patrick Wendell] Testing new Hive version with shaded jline
      4eaec65 [Prashant Sharma] Changed scripts to ignore target.
      5167bea [Prashant Sharma] small correction
      a4fcac6 [Prashant Sharma] Run against scala 2.11 on jenkins.
      80285f4 [Prashant Sharma] MAven equivalent of setting spark.executor.extraClasspath during tests.
      034b369 [Prashant Sharma] Setting test jars on executor classpath during tests from sbt.
      d4874cb [Prashant Sharma] Fixed Python Runner suite. null check should be first case in scala 2.11.
      6f50f13 [Prashant Sharma] Fixed build after rebasing with master. We should use ${scala.binary.version} instead of just 2.10
      e56ca9d [Prashant Sharma] Print an error if build for 2.10 and 2.11 is spotted.
      937c0b8 [Prashant Sharma] SCALA_VERSION -> SPARK_SCALA_VERSION
      cb059b0 [Prashant Sharma] Code review
      0476e5e [Prashant Sharma] Scala 2.11 support with repl and all build changes.
      daaca14c
    • Andrew Or's avatar
      2ddb1415
    • Timothy Chen's avatar
      SPARK-2269 Refactor mesos scheduler resourceOffers and add unit test · a878660d
      Timothy Chen authored
      Author: Timothy Chen <tnachen@gmail.com>
      
      Closes #1487 from tnachen/resource_offer_refactor and squashes the following commits:
      
      4ea5dec [Timothy Chen] Rebase from master and address comments
      9ccab09 [Timothy Chen] Address review comments
      e6494dc [Timothy Chen] Refactor class loading
      8207428 [Timothy Chen] Refactor mesos scheduler resourceOffers and add unit test
      a878660d
    • Kousuke Saruta's avatar
      [SPARK-4282][YARN] Stopping flag in YarnClientSchedulerBackend should be volatile · 7f371884
      Kousuke Saruta authored
      In YarnClientSchedulerBackend, a variable "stopping" is used as a flag and it's accessed by some threads so it should be volatile.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #3143 from sarutak/stopping-flag-volatile and squashes the following commits:
      
      58fdcc9 [Kousuke Saruta] Marked stoppig flag as volatile
      7f371884
    • Sean Owen's avatar
      SPARK-4305 [BUILD] yarn-alpha profile won't build due to network/yarn module · f820b563
      Sean Owen authored
      SPARK-3797 introduced the `network/yarn` module, but its YARN code depends on YARN APIs not present in older versions covered by the `yarn-alpha` profile. As a result builds like `mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package` fail.
      
      The solution is just to not build `network/yarn` with profile `yarn-alpha`.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3167 from srowen/SPARK-4305 and squashes the following commits:
      
      88938cb [Sean Owen] Don't build network/yarn in yarn-alpha profile as it won't compile
      f820b563
    • Prashant Sharma's avatar
      SPARK-1830 Deploy failover, Make Persistence engine and LeaderAgent Pluggable · deefd9d7
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #771 from ScrapCodes/deploy-failover-pluggable and squashes the following commits:
      
      29ba440 [Prashant Sharma] fixed a compilation error
      fef35ec [Prashant Sharma] Code review
      57ee6f0 [Prashant Sharma] SPARK-1830 Deploy failover, Make Persistence engine and LeaderAgent Pluggable.
      deefd9d7
    • huangzhaowei's avatar
      [Streaming][Minor]Replace some 'if-else' in Clock · 6e03de30
      huangzhaowei authored
      Replace some 'if-else' statement by math.min and math.max in Clock.scala
      
      Author: huangzhaowei <carlmartinmax@gmail.com>
      
      Closes #3088 from SaintBacchus/StreamingClock and squashes the following commits:
      
      7b7f8e7 [huangzhaowei] [Streaming][Minor]Replace some 'if-else' in Clock
      6e03de30
    • jerryshao's avatar
      [SPARK-2492][Streaming] kafkaReceiver minor changes to align with Kafka 0.8 · c8850a3d
      jerryshao authored
      Update the KafkaReceiver's behavior when auto.offset.reset is set.
      
      In Kafka 0.8, `auto.offset.reset` is a hint for out-range offset to seek to the beginning or end of the partition. While in the previous code `auto.offset.reset` is a enforcement to seek to the beginning or end immediately, this is different from Kafka 0.8 defined behavior.
      
      Also deleting extesting ZK metadata in Receiver when multiple consumers are launched will introduce issue as mentioned in [SPARK-2383](https://issues.apache.org/jira/browse/SPARK-2383).
      
      So Here we change to offer user to API to explicitly reset offset before create Kafka stream, while in the meantime keep the same behavior as Kafka 0.8 for parameter `auto.offset.reset`.
      
      @tdas, would you please review this PR? Thanks a lot.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #1420 from jerryshao/kafka-fix and squashes the following commits:
      
      d6ae94d [jerryshao] Address the comment to remove the resetOffset() function
      de3a4c8 [jerryshao] Fix compile error
      4a1c3f9 [jerryshao] Doc changes
      b2c1430 [jerryshao] Move offset reset to a helper function to let user explicitly delete ZK metadata by calling this API
      fac8fd6 [jerryshao] Changes to align with Kafka 0.8
      c8850a3d
    • maji2014's avatar
      [SPARK-4295][External]Fix exception in SparkSinkSuite · f8811a56
      maji2014 authored
      Handle exception in SparkSinkSuite, please refer to [SPARK-4295]
      
      Author: maji2014 <maji3@asiainfo.com>
      
      Closes #3177 from maji2014/spark-4295 and squashes the following commits:
      
      312620a [maji2014] change a new statement for spark-4295
      24c3d21 [maji2014] add log4j.properties for SparkSinkSuite and spark-4295
      c807bf6 [maji2014] Fix exception in SparkSinkSuite
      f8811a56
    • Reynold Xin's avatar
      [SPARK-4307] Initialize FileDescriptor lazily in FileRegion. · ef29a9a9
      Reynold Xin authored
      Netty's DefaultFileRegion requires a FileDescriptor in its constructor, which means we need to have a opened file handle. In super large workloads, this could lead to too many open files due to the way these file descriptors are cleaned. This pull request creates a new LazyFileRegion that initializes the FileDescriptor when we are sending data for the first time.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #3172 from rxin/lazyFD and squashes the following commits:
      
      0bdcdc6 [Reynold Xin] Added reference to Netty's DefaultFileRegion
      d4564ae [Reynold Xin] Added SparkConf to the ctor argument of IndexShuffleBlockManager.
      6ed369e [Reynold Xin] Code review feedback.
      04cddc8 [Reynold Xin] [SPARK-4307] Initialize FileDescriptor lazily in FileRegion.
      ef29a9a9
    • Davies Liu's avatar
      [SPARK-4324] [PySpark] [MLlib] support numpy.array for all MLlib API · 65083e93
      Davies Liu authored
      This PR check all of the existing Python MLlib API to make sure that numpy.array is supported as Vector (also RDD of numpy.array).
      
      It also improve some docstring and doctest.
      
      cc mateiz mengxr
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3189 from davies/numpy and squashes the following commits:
      
      d5057c4 [Davies Liu] fix tests
      6987611 [Davies Liu] support numpy.array for all MLlib API
      65083e93
    • Kousuke Saruta's avatar
      [SPARK-4330][Doc] Link to proper URL for YARN overview · 3c07b8f0
      Kousuke Saruta authored
      In running-on-yarn.md, a link to YARN overview is here.
      But the URL is to YARN alpha's.
      It should be stable's.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #3196 from sarutak/SPARK-4330 and squashes the following commits:
      
      30baa21 [Kousuke Saruta] Fixed running-on-yarn.md to point proper URL for YARN
      3c07b8f0
  3. Nov 10, 2014
    • Ankur Dave's avatar
      [SPARK-3649] Remove GraphX custom serializers · 300887bd
      Ankur Dave authored
      As [reported][1] on the mailing list, GraphX throws
      
      ```
      java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2
              at org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39)
              at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195)
              at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329)
      ```
      
      when sort-based shuffle attempts to spill to disk. This is because GraphX defines custom serializers for shuffling pair RDDs that assume Spark will always serialize the entire pair object rather than breaking it up into its components. However, the spill code path in sort-based shuffle [violates this assumption][2].
      
      GraphX uses the custom serializers to compress vertex ID keys using variable-length integer encoding. However, since the serializer can no longer rely on the key and value being serialized and deserialized together, performing such encoding would either require writing a tag byte (costly) or maintaining state in the serializer and assuming that serialization calls will alternate between key and value (fragile).
      
      Instead, this PR simply removes the custom serializers. This causes a **10% slowdown** (494 s to 543 s) and **16% increase in per-iteration communication** (2176 MB to 2518 MB) for PageRank (averages across 3 trials, 10 iterations per trial, uk-2007-05 graph, 16 r3.2xlarge nodes).
      
      [1]: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501
      [2]: https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #2503 from ankurdave/SPARK-3649 and squashes the following commits:
      
      a49c2ad [Ankur Dave] [SPARK-3649] Remove GraphX custom serializers
      300887bd
    • Cheng Hao's avatar
      [SPARK-4274] [SQL] Fix NPE in printing the details of the query plan · c764d0ac
      Cheng Hao authored
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3139 from chenghao-intel/comparison_test and squashes the following commits:
      
      f5d7146 [Cheng Hao] avoid exception in printing the codegen enabled
      c764d0ac
    • surq's avatar
      [SPARK-3954][Streaming] Optimization to FileInputDStream · ce6ed2ab
      surq authored
      about convert files to RDDS there are 3 loops with files sequence in spark source.
      loops files sequence:
      1.files.map(...)
      2.files.zip(fileRDDs)
      3.files-size.foreach
      It's will very time consuming when lots of files.So I do the following correction:
      3 loops with files sequence => only one loop
      
      Author: surq <surq@asiainfo.com>
      
      Closes #2811 from surq/SPARK-3954 and squashes the following commits:
      
      321bbe8 [surq]  updated the code style.The style from [for...yield]to [files.map(file=>{})]
      88a2c20 [surq] Merge branch 'master' of https://github.com/apache/spark into SPARK-3954
      178066f [surq] modify code's style. [Exceeds 100 columns]
      626ef97 [surq] remove redundant import(ArrayBuffer)
      739341f [surq] promote the speed of convert files to RDDS
      ce6ed2ab
    • Daoyuan Wang's avatar
      [SPARK-4149][SQL] ISO 8601 support for json date time strings · a1fc059b
      Daoyuan Wang authored
      This implement the feature davies mentioned in https://github.com/apache/spark/pull/2901#discussion-diff-19313312
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #3012 from adrian-wang/iso8601 and squashes the following commits:
      
      50df6e7 [Daoyuan Wang] json data timestamp ISO8601 support
      a1fc059b
    • Cheng Hao's avatar
      [SPARK-4250] [SQL] Fix bug of constant null value mapping to ConstantObjectInspector · fa777833
      Cheng Hao authored
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3114 from chenghao-intel/constant_null_oi and squashes the following commits:
      
      e603bda [Cheng Hao] fix the bug of null value for primitive types
      50a13ba [Cheng Hao] fix the timezone issue
      f54f369 [Cheng Hao] fix bug of constant null value for ObjectInspector
      fa777833
    • Xiangrui Meng's avatar
      [SQL] remove a decimal case branch that has no effect at runtime · d793d80c
      Xiangrui Meng authored
      it generates warnings at compile time marmbrus
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3192 from mengxr/dtc-decimal and squashes the following commits:
      
      955e9fb [Xiangrui Meng] remove a decimal case branch that has no effect
      d793d80c
    • Cheng Lian's avatar
      [SPARK-4308][SQL] Sets SQL operation state to ERROR when exception is thrown · acb55aed
      Cheng Lian authored
      In `HiveThriftServer2`, when an exception is thrown during a SQL execution, the SQL operation state should be set to `ERROR`, but now it remains `RUNNING`. This affects the result of the `GetOperationStatus` Thrift API.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3175 from liancheng/fix-op-state and squashes the following commits:
      
      6d4c1fe [Cheng Lian] Sets SQL operation state to ERROR when exception is thrown
      acb55aed
    • Cheng Lian's avatar
      [SPARK-4000][Build] Uploads HiveCompatibilitySuite logs · 534b2314
      Cheng Lian authored
      This is a follow up of #2845. In addition to unit-tests.log files, also upload failure output files generated by `HiveCompatibilitySuite` to Jenkins master. These files can be very helpful to debug Hive compatibility test failures.
      
      /cc pwendell marmbrus
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #2993 from liancheng/upload-hive-compat-logs and squashes the following commits:
      
      8e6247f [Cheng Lian] Uploads HiveCompatibilitySuite logs
      534b2314
    • Takuya UESHIN's avatar
      [SPARK-4319][SQL] Enable an ignored test "null count". · dbf10588
      Takuya UESHIN authored
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #3185 from ueshin/issues/SPARK-4319 and squashes the following commits:
      
      a44a38e [Takuya UESHIN] Enable an ignored test "null count".
      dbf10588
    • Patrick Wendell's avatar
      Revert "[SPARK-2703][Core]Make Tachyon related unit tests execute without... · 6e7a309b
      Patrick Wendell authored
      Revert "[SPARK-2703][Core]Make Tachyon related unit tests execute without deploying a Tachyon system locally."
      
      This reverts commit bd86cb17.
      6e7a309b
    • Varadharajan Mukundan's avatar
      [SPARK-4047] - Generate runtime warnings for example implementation of PageRank · 974d334c
      Varadharajan Mukundan authored
      Based on SPARK-2434, this PR generates runtime warnings for example implementations (Python, Scala) of PageRank.
      
      Author: Varadharajan Mukundan <srinathsmn@gmail.com>
      
      Closes #2894 from varadharajan/SPARK-4047 and squashes the following commits:
      
      5f9406b [Varadharajan Mukundan] [SPARK-4047] - Point users to LogisticRegressionWithSGD and LogisticRegressionWithLBFGS instead of LogisticRegressionModel
      252f595 [Varadharajan Mukundan] a. Generate runtime warnings for
      05a018b [Varadharajan Mukundan] Fix PageRank implementation's package reference
      5c2bf54 [Varadharajan Mukundan] [SPARK-4047] - Generate runtime warnings for example implementation of PageRank
      974d334c
    • tedyu's avatar
      SPARK-1297 Upgrade HBase dependency to 0.98 · b32734e1
      tedyu authored
      pwendell rxin
      Please take a look
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #3115 from tedyu/master and squashes the following commits:
      
      2b079c8 [tedyu] SPARK-1297 Upgrade HBase dependency to 0.98
      b32734e1
    • Sandy Ryza's avatar
      SPARK-4230. Doc for spark.default.parallelism is incorrect · c6f4e704
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3107 from sryza/sandy-spark-4230 and squashes the following commits:
      
      37a1d19 [Sandy Ryza] Clear up a couple things
      34d53de [Sandy Ryza] SPARK-4230. Doc for spark.default.parallelism is incorrect
      c6f4e704
Loading