Skip to content
Snippets Groups Projects
  1. Nov 09, 2015
    • Burak Yavuz's avatar
      [SPARK-11198][STREAMING][KINESIS] Support de-aggregation of records during recovery · 26062d22
      Burak Yavuz authored
      While the KCL handles de-aggregation during the regular operation, during recovery we use the lower level api, and therefore need to de-aggregate the records.
      
      tdas Testing is an issue, we need protobuf magic to do the aggregated records. Maybe we could depend on KPL for tests?
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #9403 from brkyvz/kinesis-deaggregation.
      26062d22
    • Yuhao Yang's avatar
      [SPARK-11069][ML] Add RegexTokenizer option to convert to lowercase · 61f9c871
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-11069
      quotes from jira:
      Tokenizer converts strings to lowercase automatically, but RegexTokenizer does not. It would be nice to add an option to RegexTokenizer to convert to lowercase. Proposal:
      call the Boolean Param "toLowercase"
      set default to false (so behavior does not change)
      
      Actually sklearn converts to lowercase before tokenizing too
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #9092 from hhbyyh/tokenLower.
      61f9c871
    • Yu ISHIKAWA's avatar
      [SPARK-11610][MLLIB][PYTHON][DOCS] Make the docs of LDAModel.describeTopics in Python more specific · 7dc9d8db
      Yu ISHIKAWA authored
      cc jkbradley
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #9577 from yu-iskw/SPARK-11610.
      7dc9d8db
    • Reynold Xin's avatar
      [SPARK-11564][SQL] Fix documentation for DataFrame.take/collect · 675c7e72
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9557 from rxin/SPARK-11564-1.
      675c7e72
    • Michael Armbrust's avatar
      [SPARK-11578][SQL] User API for Typed Aggregation · 9c740a9d
      Michael Armbrust authored
      This PR adds a new interface for user-defined aggregations, that can be used in `DataFrame` and `Dataset` operations to take all of the elements of a group and reduce them to a single value.
      
      For example, the following aggregator extracts an `int` from a specific class and adds them up:
      
      ```scala
        case class Data(i: Int)
      
        val customSummer =  new Aggregator[Data, Int, Int] {
          def prepare(d: Data) = d.i
          def reduce(l: Int, r: Int) = l + r
          def present(r: Int) = r
        }.toColumn()
      
        val ds: Dataset[Data] = ...
        val aggregated = ds.select(customSummer)
      ```
      
      By using helper functions, users can make a generic `Aggregator` that works on any input type:
      
      ```scala
      /** An `Aggregator` that adds up any numeric type returned by the given function. */
      class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with Serializable {
        val numeric = implicitly[Numeric[N]]
        override def zero: N = numeric.zero
        override def reduce(b: N, a: I): N = numeric.plus(b, f(a))
        override def present(reduction: N): N = reduction
      }
      
      def sum[I, N : Numeric : Encoder](f: I => N): TypedColumn[I, N] = new SumOf(f).toColumn
      ```
      
      These aggregators can then be used alongside other built-in SQL aggregations.
      
      ```scala
      val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS()
      ds
        .groupBy(_._1)
        .agg(
          sum(_._2),                // The aggregator defined above.
          expr("sum(_2)").as[Int],  // A built-in dynatically typed aggregation.
          count("*"))               // A built-in statically typed aggregation.
        .collect()
      
      res0: ("a", 30, 30, 2L), ("b", 3, 3, 2L), ("c", 1, 1, 1L)
      ```
      
      The current implementation focuses on integrating this into the typed API, but currently only supports running aggregations that return a single long value as explained in `TypedAggregateExpression`.  This will be improved in a followup PR.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #9555 from marmbrus/dataset-useragg.
      9c740a9d
    • gatorsmile's avatar
      [SPARK-11360][DOC] Loss of nullability when writing parquet files · 2f383788
      gatorsmile authored
      This fix is to add one line to explain the current behavior of Spark SQL when writing Parquet files. All columns are forced to be nullable for compatibility reasons.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #9314 from gatorsmile/lossNull.
      2f383788
    • hyukjinkwon's avatar
      [SPARK-9557][SQL] Refactor ParquetFilterSuite and remove old ParquetFilters code · 9565c246
      hyukjinkwon authored
      Actually this was resolved by https://github.com/apache/spark/pull/8275.
      
      But I found the JIRA issue for this is not marked as resolved since the PR above was made for another issue but the PR above resolved both.
      
      I commented that this is resolved by the PR above; however, I opened this PR as I would like to just add
      a little bit of corrections.
      
      In the previous PR, I refactored the test by not reducing just collecting filters; however, this would not test  properly `And` filter (which is not given to the tests). I unintentionally changed this from the original way (before being refactored).
      
      In this PR, I just followed the original way to collect filters by reducing.
      
      I would like to close this if this PR is inappropriate and somebody would like this deal with it in the separate PR related with this.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #9554 from HyukjinKwon/SPARK-9557.
      9565c246
    • Wenchen Fan's avatar
      [SPARK-11564][SQL][FOLLOW-UP] improve java api for GroupedDataset · fcb57e9c
      Wenchen Fan authored
      created `MapGroupFunction`, `FlatMapGroupFunction`, `CoGroupFunction`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9564 from cloud-fan/map.
      fcb57e9c
    • Yu ISHIKAWA's avatar
      [SPARK-6517][MLLIB] Implement the Algorithm of Hierarchical Clustering · 8a233689
      Yu ISHIKAWA authored
      I implemented a hierarchical clustering algorithm again.  This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later.
      https://issues.apache.org/jira/browse/SPARK-6517
      
      - This implementation based on a bi-sectiong K-means clustering.
          - It derives from the freeman-lab 's implementation
      - The basic idea is not changed from the previous version. (#2906)
          - However, It is 1000x faster than the previous version through parallel processing.
      
      Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen).
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com>
      
      Closes #5267 from yu-iskw/new-hierarchical-clustering.
      8a233689
    • Burak Yavuz's avatar
      [SPARK-11359][STREAMING][KINESIS] Checkpoint to DynamoDB even when new data doesn't come in · a3a7c910
      Burak Yavuz authored
      Currently, the checkpoints to DynamoDB occur only when new data comes in, as we update the clock for the checkpointState. This PR makes the checkpoint a scheduled execution based on the `checkpointInterval`.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #9421 from brkyvz/kinesis-checkpoint.
      a3a7c910
    • Cheng Lian's avatar
      [SPARK-11595] [SQL] Fixes ADD JAR when the input path contains URL scheme · 150f6a89
      Cheng Lian authored
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #9569 from liancheng/spark-11595.fix-add-jar.
      150f6a89
    • Nick Buroojy's avatar
      [SPARK-9301][SQL] Add collect_set and collect_list aggregate functions · f138cb87
      Nick Buroojy authored
      For now they are thin wrappers around the corresponding Hive UDAFs.
      
      One limitation with these in Hive 0.13.0 is they only support aggregating primitive types.
      
      I chose snake_case here instead of camelCase because it seems to be used in the majority of the multi-word fns.
      
      Do we also want to add these to `functions.py`?
      
      This approach was recommended here: https://github.com/apache/spark/pull/8592#issuecomment-154247089
      
      
      
      marmbrus rxin
      
      Author: Nick Buroojy <nick.buroojy@civitaslearning.com>
      
      Closes #9526 from nburoojy/nick/udaf-alias.
      
      (cherry picked from commit a6ee4f98)
      Signed-off-by: default avatarMichael Armbrust <michael@databricks.com>
      f138cb87
    • Rishabh Bhardwaj's avatar
      [SPARK-11548][DOCS] Replaced example code in mllib-collaborative-filtering.md using include_example · b7720fa4
      Rishabh Bhardwaj authored
      Kindly review the changes.
      
      Author: Rishabh Bhardwaj <rbnext29@gmail.com>
      
      Closes #9519 from rishabhbhardwaj/SPARK-11337.
      b7720fa4
    • sachin aggarwal's avatar
      [SPARK-11552][DOCS][Replaced example code in ml-decision-tree.md using include_example] · 51d41e4b
      sachin aggarwal authored
      I have tested it on my local, it is working fine, please review
      
      Author: sachin aggarwal <different.sachin@gmail.com>
      
      Closes #9539 from agsachin/SPARK-11552-real.
      51d41e4b
    • Felix Bechstein's avatar
      [SPARK-10471][CORE][MESOS] prevent getting offers for unmet constraints · 5039a49b
      Felix Bechstein authored
      this change rejects offers for slaves with unmet constraints for 120s to mitigate offer starvation.
      this prevents mesos to send us these offers again and again.
      in return, we get more offers for slaves which might meet our constraints.
      and it enables mesos to send the rejected offers to other frameworks.
      
      Author: Felix Bechstein <felix.bechstein@otto.de>
      
      Closes #8639 from felixb/decline_offers_constraint_mismatch.
      5039a49b
    • Yu ISHIKAWA's avatar
      [SPARK-10280][MLLIB][PYSPARK][DOCS] Add @since annotation to pyspark.ml.classification · 88a3fdcc
      Yu ISHIKAWA authored
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #8690 from yu-iskw/SPARK-10280.
      88a3fdcc
    • Bharat Lal's avatar
      [SPARK-11581][DOCS] Example mllib code in documentation incorrectly computes MSE · 860ea0d3
      Bharat Lal authored
      Author: Bharat Lal <bharat.iisc@gmail.com>
      
      Closes #9560 from bharatl/SPARK-11581.
      860ea0d3
    • chriskang90's avatar
      [DOCS] Fix typo for Python section on unifying Kafka streams · 874cd66d
      chriskang90 authored
      1) kafkaStreams is a list.  The list should be unpacked when passing it into the streaming context union method, which accepts a variable number of streams.
      2) print() should be pprint() for pyspark.
      
      This contribution is my original work, and I license the work to the project under the project's open source license.
      
      Author: chriskang90 <jckang@uchicago.edu>
      
      Closes #9545 from c-kang/streaming_python_typo.
      874cd66d
    • felixcheung's avatar
      [SPARK-9865][SPARKR] Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame · cd174882
      felixcheung authored
      Make sample test less flaky by setting the seed
      
      Tested with
      ```
      repeat {  if (count(sample(df, FALSE, 0.1)) == 3) { break } }
      ```
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #9549 from felixcheung/rsample.
      cd174882
    • tedyu's avatar
      [SPARK-11112] Fix Scala 2.11 compilation error in RDDInfo.scala · 404a28f4
      tedyu authored
      As shown in https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1946/console , compilation fails with:
      ```
      [error] /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/storage/RDDInfo.scala:25: in class RDDInfo, multiple overloaded alternatives of constructor RDDInfo define default arguments.
      [error] class RDDInfo(
      [error]
      ```
      This PR tries to fix the compilation error
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #9538 from tedyu/master.
      404a28f4
    • Charles Yeh's avatar
      [SPARK-10565][CORE] add missing web UI stats to /api/v1/applications JSON · 08a7a836
      Charles Yeh authored
      I looked at the other endpoints, and they don't seem to be missing any fields.
      Added fields:
      ![image](https://cloud.githubusercontent.com/assets/613879/10948801/58159982-82e4-11e5-86dc-62da201af910.png)
      
      Author: Charles Yeh <charlesyeh@dropbox.com>
      
      Closes #9472 from CharlesYeh/api_vars.
      08a7a836
    • fazlan-nazeem's avatar
      [SPARK-11582][MLLIB] specifying pmml version attribute =4.2 in the root node of pmml model · 9b88e1dc
      fazlan-nazeem authored
      The current pmml models generated do not specify the pmml version in its root node. This is a problem when using this pmml model in other tools because they expect the version attribute to be set explicitly. This fix adds the pmml version attribute to the generated pmml models and specifies its value as 4.2.
      
      Author: fazlan-nazeem <fazlann@wso2.com>
      
      Closes #9558 from fazlan-nazeem/master.
      9b88e1dc
    • Yanbo Liang's avatar
      [SPARK-10689][ML][DOC] User guide and example code for AFTSurvivalRegression · d50a66cc
      Yanbo Liang authored
      Add user guide and example code for ```AFTSurvivalRegression```.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #9491 from yanboliang/spark-10689.
      d50a66cc
    • Yanbo Liang's avatar
      [SPARK-11494][ML][R] Expose R-like summary statistics in SparkR::glm for linear regression · 8c0e1b50
      Yanbo Liang authored
      Expose R-like summary statistics in SparkR::glm for linear regression, the output of ```summary``` like
      ```Java
      $DevianceResiduals
       Min        Max
       -0.9509607 0.7291832
      
      $Coefficients
                         Estimate   Std. Error t value   Pr(>|t|)
      (Intercept)        1.6765     0.2353597  7.123139  4.456124e-11
      Sepal_Length       0.3498801  0.04630128 7.556598  4.187317e-12
      Species_versicolor -0.9833885 0.07207471 -13.64402 0
      Species_virginica  -1.00751   0.09330565 -10.79796 0
      ```
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #9561 from yanboliang/spark-11494.
      8c0e1b50
    • Rohit Agarwal's avatar
      [DOC][MINOR][SQL] Fix internal link · b541b316
      Rohit Agarwal authored
      It doesn't show up as a hyperlink currently. It will show up as a hyperlink after this change.
      
      Author: Rohit Agarwal <mindprince@gmail.com>
      
      Closes #9544 from mindprince/patch-2.
      b541b316
    • Charles Yeh's avatar
      [SPARK-11218][CORE] show help messages for start-slave and start-master · 9e48cdfb
      Charles Yeh authored
      Addressing https://issues.apache.org/jira/browse/SPARK-11218, mostly copied start-thriftserver.sh.
      ```
      charlesyeh-mbp:spark charlesyeh$ ./sbin/start-master.sh --help
      Usage: Master [options]
      
      Options:
        -i HOST, --ip HOST     Hostname to listen on (deprecated, please use --host or -h)
        -h HOST, --host HOST   Hostname to listen on
        -p PORT, --port PORT   Port to listen on (default: 7077)
        --webui-port PORT      Port for web UI (default: 8080)
        --properties-file FILE Path to a custom Spark properties file.
                               Default is conf/spark-defaults.conf.
      ```
      ```
      charlesyeh-mbp:spark charlesyeh$ ./sbin/start-slave.sh
      Usage: Worker [options] <master>
      
      Master must be a URL of the form spark://hostname:port
      
      Options:
        -c CORES, --cores CORES  Number of cores to use
        -m MEM, --memory MEM     Amount of memory to use (e.g. 1000M, 2G)
        -d DIR, --work-dir DIR   Directory to run apps in (default: SPARK_HOME/work)
        -i HOST, --ip IP         Hostname to listen on (deprecated, please use --host or -h)
        -h HOST, --host HOST     Hostname to listen on
        -p PORT, --port PORT     Port to listen on (default: random)
        --webui-port PORT        Port for web UI (default: 8081)
        --properties-file FILE   Path to a custom Spark properties file.
                                 Default is conf/spark-defaults.conf.
      ```
      
      Author: Charles Yeh <charlesyeh@dropbox.com>
      
      Closes #9432 from CharlesYeh/helpmsg.
      9e48cdfb
  2. Nov 08, 2015
  3. Nov 07, 2015
  4. Nov 06, 2015
    • Andrew Or's avatar
      [SPARK-11112] DAG visualization: display RDD callsite · 7f741905
      Andrew Or authored
      <img width="548" alt="screen shot 2015-11-01 at 9 42 33 am" src="https://cloud.githubusercontent.com/assets/2133137/10870343/2a8cd070-807d-11e5-857a-4ebcace77b5b.png">
      mateiz sarutak
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #9398 from andrewor14/rdd-callsite.
      7f741905
    • Josh Rosen's avatar
      [SPARK-11389][CORE] Add support for off-heap memory to MemoryManager · 30b706b7
      Josh Rosen authored
      In order to lay the groundwork for proper off-heap memory support in SQL / Tungsten, we need to extend our MemoryManager to perform bookkeeping for off-heap memory.
      
      ## User-facing changes
      
      This PR introduces a new configuration, `spark.memory.offHeapSize` (name subject to change), which specifies the absolute amount of off-heap memory that Spark and Spark SQL can use. If Tungsten is configured to use off-heap execution memory for allocating data pages, then all data page allocations must fit within this size limit.
      
      ## Internals changes
      
      This PR contains a lot of internal refactoring of the MemoryManager. The key change at the heart of this patch is the introduction of a `MemoryPool` class (name subject to change) to manage the bookkeeping for a particular category of memory (storage, on-heap execution, and off-heap execution). These MemoryPools are not fixed-size; they can be dynamically grown and shrunk according to the MemoryManager's policies. In StaticMemoryManager, these pools have fixed sizes, proportional to the legacy `[storage|shuffle].memoryFraction`. In the new UnifiedMemoryManager, the sizes of these pools are dynamically adjusted according to its policies.
      
      There are two subclasses of `MemoryPool`: `StorageMemoryPool` manages storage memory and `ExecutionMemoryPool` manages execution memory. The MemoryManager creates two execution pools, one for on-heap memory and one for off-heap. Instances of `ExecutionMemoryPool` manage the logic for fair sharing of their pooled memory across running tasks (in other words, the ShuffleMemoryManager-like logic has been moved out of MemoryManager and pushed into these ExecutionMemoryPool instances).
      
      I think that this design is substantially easier to understand and reason about than the previous design, where most of these responsibilities were handled by MemoryManager and its subclasses. To see this, take at look at how simple the logic in `UnifiedMemoryManager` has become: it's now very easy to see when memory is dynamically shifted between storage and execution.
      
      ## TODOs
      
      - [x] Fix handful of test failures in the MemoryManagerSuites.
      - [x] Fix remaining TODO comments in code.
      - [ ] Document new configuration.
      - [x] Fix commented-out tests / asserts:
        - [x] UnifiedMemoryManagerSuite.
      - [x] Write tests that exercise the new off-heap memory management policies.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9344 from JoshRosen/offheap-memory-accounting.
      30b706b7
    • Michael Armbrust's avatar
      [HOTFIX] Fix python tests after #9527 · 105732dc
      Michael Armbrust authored
      #9527 missed updating the python tests.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #9533 from marmbrus/hotfixTextValue.
      105732dc
    • navis.ryu's avatar
      [SPARK-11546] Thrift server makes too many logs about result schema · 1c80d66e
      navis.ryu authored
      SparkExecuteStatementOperation logs result schema for each getNextRowSet() calls which is by default every 1000 rows, overwhelming whole log file.
      
      Author: navis.ryu <navis@apache.org>
      
      Closes #9514 from navis/SPARK-11546.
      1c80d66e
Loading