Skip to content
Snippets Groups Projects
  1. Feb 10, 2016
  2. Feb 09, 2016
    • Shixiong Zhu's avatar
      [SPARK-13149][SQL] Add FileStreamSource · b385ce38
      Shixiong Zhu authored
      `FileStreamSource` is an implementation of `org.apache.spark.sql.execution.streaming.Source`. It takes advantage of the existing `HadoopFsRelationProvider` to support various file formats. It remembers files in each batch and stores it into the metadata files so as to recover them when restarting. The metadata files are stored in the file system. There will be a further PR to clean up the metadata files periodically.
      
      This is based on the initial work from marmbrus.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11034 from zsxwing/stream-df-file-source.
      b385ce38
    • Takeshi YAMAMURO's avatar
      [SPARK-12476][SQL] Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter · 6f710f9f
      Takeshi YAMAMURO authored
      Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'
      
      Current plan:
      ```
      == Optimized Logical Plan ==
      Project [col0#0,col1#1]
      +- Filter (col0#0 = xxx)
         +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})
      
      == Physical Plan ==
      +- Filter (col0#0 = xxx)
         +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)]
      ```
      
      This patch enables a plan below;
      ```
      == Optimized Logical Plan ==
      Project [col0#0,col1#1]
      +- Filter (col0#0 = xxx)
         +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})
      
      == Physical Plan ==
      Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)]
      ```
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #10427 from maropu/RemoveFilterInJdbcScan.
      6f710f9f
    • Liang-Chi Hsieh's avatar
      [SPARK-10524][ML] Use the soft prediction to order categories' bins · 9267bc68
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-10524
      
      Currently we use the hard prediction (`ImpurityCalculator.predict`) to order categories' bins. But we should use the soft prediction.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      Author: Liang-Chi Hsieh <viirya@appier.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #8734 from viirya/dt-soft-centroids.
      9267bc68
    • Davies Liu's avatar
      [SPARK-12950] [SQL] Improve lookup of BytesToBytesMap in aggregate · 0e5ebac3
      Davies Liu authored
      This PR improve the lookup of BytesToBytesMap by:
      
      1. Generate code for calculate the hash code of grouping keys.
      
      2. Do not use MemoryLocation, fetch the baseObject and offset for key and value directly (remove the indirection).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11010 from davies/gen_map.
      0e5ebac3
    • Shixiong Zhu's avatar
      [SPARK-13245][CORE] Call shuffleMetrics methods only in one thread for ShuffleBlockFetcherIterator · fae830d1
      Shixiong Zhu authored
      Call shuffleMetrics's incRemoteBytesRead and incRemoteBlocksFetched when polling FetchResult from `results` so as to always use shuffleMetrics in one thread.
      
      Also fix a race condition that could cause memory leak.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11138 from zsxwing/SPARK-13245.
      fae830d1
    • Wenchen Fan's avatar
      [SPARK-12888] [SQL] [FOLLOW-UP] benchmark the new hash expression · 7fe4fe63
      Wenchen Fan authored
      Adds the benchmark results as comments.
      
      The codegen version is slower than the interpreted version for `simple` case becasue of 3 reasons:
      
      1. codegen version use a more complex hash algorithm than interpreted version, i.e. `Murmur3_x86_32.hashInt` vs [simple multiplication and addition](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L153).
      2. codegen version will write the hash value to a row first and then read it out. I tried to create a `GenerateHasher` that can generate code to return hash value directly and got about 60% speed up for the `simple` case, does it worth?
      3. the row in `simple` case only has one int field, so the runtime reflection may be removed because of branch prediction, which makes the interpreted version faster.
      
      The `array` case is also slow for similar reasons, e.g. array elements are of same type, so interpreted version can probably get rid of runtime reflection by branch prediction.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10917 from cloud-fan/hash-benchmark.
      7fe4fe63
    • Luciano Resende's avatar
      [SPARK-13189] Cleanup build references to Scala 2.10 · 2dbb9164
      Luciano Resende authored
      Author: Luciano Resende <lresende@apache.org>
      
      Closes #11092 from lresende/SPARK-13189.
      2dbb9164
    • Steve Loughran's avatar
      [SPARK-12807][YARN] Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3 · 34d0b70b
      Steve Loughran authored
      Patch to
      
      1. Shade jackson 2.x in spark-yarn-shuffle JAR: core, databind, annotation
      2. Use maven antrun to verify the JAR has the renamed classes
      
      Being Maven-based, I don't know if the verification phase kicks in on an SBT/jenkins build. It will on a `mvn install`
      
      Author: Steve Loughran <stevel@hortonworks.com>
      
      Closes #10780 from steveloughran/stevel/patches/SPARK-12807-master-shuffle.
      34d0b70b
    • Sean Owen's avatar
      [SPARK-13170][STREAMING] Investigate replacing SynchronizedQueue as it is deprecated · 68ed3632
      Sean Owen authored
      Replace SynchronizeQueue with synchronized access to a Queue
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11111 from srowen/SPARK-13170.
      68ed3632
    • Iulian Dragos's avatar
      [SPARK-13086][SHELL] Use the Scala REPL settings, to enable things like `-i file`. · e30121af
      Iulian Dragos authored
      Now:
      
      ```
      $ bin/spark-shell -i test.scala
      NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
      Setting default log level to "WARN".
      To adjust logging level use sc.setLogLevel(newLevel).
      16/01/29 17:37:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      16/01/29 17:37:39 INFO Main: Created spark context..
      Spark context available as sc (master = local[*], app id = local-1454085459000).
      16/01/29 17:37:39 INFO Main: Created sql context..
      SQL context available as sqlContext.
      Loading test.scala...
      hello
      
      Welcome to
            ____              __
           / __/__  ___ _____/ /__
          _\ \/ _ \/ _ `/ __/  '_/
         /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
            /_/
      
      Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
      Type in expressions to have them evaluated.
      Type :help for more information.
      ```
      
      Author: Iulian Dragos <jaguarul@gmail.com>
      
      Closes #10984 from dragos/issue/repl-eval-file.
      e30121af
    • sachin aggarwal's avatar
      [SPARK-13177][EXAMPLES] Update ActorWordCount example to not directly use low... · d9ba4d27
      sachin aggarwal authored
      [SPARK-13177][EXAMPLES] Update ActorWordCount example to not directly use low level linked list as it is deprecated.
      
      Author: sachin aggarwal <different.sachin@gmail.com>
      
      Closes #11113 from agsachin/master.
      d9ba4d27
    • Sebastián Ramírez's avatar
      [SPARK-13040][DOCS] Update JDBC deprecated SPARK_CLASSPATH documentation · c882ec57
      Sebastián Ramírez authored
      Update JDBC documentation based on http://stackoverflow.com/a/30947090/219530 as SPARK_CLASSPATH is deprecated.
      
      Also, that's how it worked, it didn't work with the SPARK_CLASSPATH or the --jars alone.
      
      This would solve issue: https://issues.apache.org/jira/browse/SPARK-13040
      
      Author: Sebastián Ramírez <tiangolo@gmail.com>
      
      Closes #10948 from tiangolo/patch-docs-jdbc.
      c882ec57
    • Holden Karau's avatar
      [SPARK-13201][SPARK-13200] Deprecation warning cleanups: KMeans & MFDataGenerator · ce83fe97
      Holden Karau authored
      KMeans:
      Make a private non-deprecated version of setRuns API so that we can call it from the PythonAPI without deprecation warnings in our own build. Also use it internally when being called from train. Add a logWarning for non-1 values
      
      MFDataGenerator:
      Apparently we are calling round on an integer which now in Scala 2.11 results in a warning (it didn't make any sense before either). Figure out if this is a mistake we can just remove or if we got the types wrong somewhere.
      
      I put these two together since they are both deprecation fixes in MLlib and pretty small, but I can split them up if we would prefer it that way.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #11112 from holdenk/SPARK-13201-non-deprecated-setRuns-SPARK-mathround-integer.
      ce83fe97
    • Holden Karau's avatar
      [SPARK-13165][STREAMING] Replace deprecated synchronizedBuffer in streaming · 159198ef
      Holden Karau authored
      Building with Scala 2.11 results in the warning trait SynchronizedBuffer in package mutable is deprecated: Synchronization via traits is deprecated as it is inherently unreliable. Consider java.util.concurrent.ConcurrentLinkedQueue as an alternative - we already use ConcurrentLinkedQueue elsewhere so lets replace it.
      
      Some notes about how behaviour is different for reviewers:
      The Seq from a SynchronizedBuffer that was implicitly converted would continue to receive updates - however when we do the same conversion explicitly on the ConcurrentLinkedQueue this isn't the case. Hence changing some of the (internal & test) APIs to pass an Iterable. toSeq is safe to use if there are no more updates.
      
      Author: Holden Karau <holden@us.ibm.com>
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #11067 from holdenk/SPARK-13165-replace-deprecated-synchronizedBuffer-in-streaming.
      159198ef
    • Jakob Odersky's avatar
      [SPARK-13176][CORE] Use native file linking instead of external process ln · f9307d8f
      Jakob Odersky authored
      Since Spark requires at least JRE 1.7, it is safe to use built-in java.nio.Files.
      
      Author: Jakob Odersky <jakob@odersky.com>
      
      Closes #11098 from jodersky/SPARK-13176.
      f9307d8f
    • Nong Li's avatar
      [SPARK-12992] [SQL] Support vectorized decoding in UnsafeRowParquetRecordReader. · 3708d13f
      Nong Li authored
      WIP: running tests. Code needs a bit of clean up.
      
      This patch completes the vectorized decoding with the goal of passing the existing
      tests. There is still more patches to support the rest of the format spec, even
      just for flat schemas.
      
      This patch adds a new flag to enable the vectorized decoding. Tests were updated
      to try with both modes where applicable.
      
      Once this is working well, we can remove the previous code path.
      
      Author: Nong Li <nong@databricks.com>
      
      Closes #11055 from nongli/spark-12992-2.
      3708d13f
  3. Feb 08, 2016
    • Andrew Or's avatar
      [SPARK-10620][SPARK-13054] Minor addendum to #10835 · eeaf45b9
      Andrew Or authored
      Additional changes to #10835, mainly related to style and visibility. This patch also adds back a few deprecated methods for backward compatibility.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #10958 from andrewor14/task-metrics-to-accums-followups.
      eeaf45b9
    • Davies Liu's avatar
      [SPARK-13095] [SQL] improve performance for broadcast join with dimension table · ff0af0dd
      Davies Liu authored
      This PR improve the performance for Broadcast join with dimension tables, which is common in data warehouse.
      
      If the join key can fit in a long, we will use a special api `get(Long)` to get the rows from HashedRelation.
      
      If the HashedRelation only have unique keys, we will use a special api `getValue(Long)` or `getValue(InternalRow)`.
      
      If the keys can fit within a long, also the keys are dense, we will use a array of UnsafeRow, instead a hash map.
      
      TODO: will do cleanup
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11065 from davies/gen_dim.
      ff0af0dd
    • Davies Liu's avatar
      [SPARK-13210][SQL] catch OOM when allocate memory and expand array · 37bc203c
      Davies Liu authored
      There is a bug when we try to grow the buffer, OOM is ignore wrongly (the assert also skipped by JVM), then we try grow the array again, this one will trigger spilling free the current page, the current record we inserted will be invalid.
      
      The root cause is that JVM has less free memory than MemoryManager thought, it will OOM when allocate a page without trigger spilling. We should catch the OOM, and acquire memory again to trigger spilling.
      
      And also, we could not grow the array in `insertRecord` of `InMemorySorter` (it was there just for easy testing).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11095 from davies/fix_expand.
      37bc203c
    • Wenchen Fan's avatar
      [SPARK-13101][SQL] nullability of array type element should not fail analysis of encoder · 8e4d15f7
      Wenchen Fan authored
      nullability should only be considered as an optimization rather than part of the type system, so instead of failing analysis for mismatch nullability, we should pass analysis and add runtime null check.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11035 from cloud-fan/ignore-nullability.
      8e4d15f7
    • Josh Rosen's avatar
      [SPARK-8964] [SQL] Use Exchange to perform shuffle in Limit · 06f0df6d
      Josh Rosen authored
      This patch changes the implementation of the physical `Limit` operator so that it relies on the `Exchange` operator to perform data movement rather than directly using `ShuffledRDD`. In addition to improving efficiency, this lays the necessary groundwork for further optimization of limit, such as limit pushdown or whole-stage codegen.
      
      At a high-level, this replaces the old physical `Limit` operator with two new operators, `LocalLimit` and `GlobalLimit`. `LocalLimit` performs per-partition limits, while `GlobalLimit` applies the final limit to a single partition; `GlobalLimit`'s declares that its `requiredInputDistribution` is `SinglePartition`, which will cause the planner to use an `Exchange` to perform the appropriate shuffles. Thus, a logical `Limit` appearing in the middle of a query plan will be expanded into `LocalLimit -> Exchange to one partition -> GlobalLimit`.
      
      In the old code, calling `someDataFrame.limit(100).collect()` or `someDataFrame.take(100)` would actually skip the shuffle and use a fast-path which used `executeTake()` in order to avoid computing all partitions in case only a small number of rows were requested. This patch preserves this optimization by treating logical `Limit` operators specially when they appear as the terminal operator in a query plan: if a `Limit` is the final operator, then we will plan a special `CollectLimit` physical operator which implements the old `take()`-based logic.
      
      In order to be able to match on operators only at the root of the query plan, this patch introduces a special `ReturnAnswer` logical operator which functions similar to `BroadcastHint`: this dummy operator is inserted at the root of the optimized logical plan before invoking the physical planner, allowing the planner to pattern-match on it.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7334 from JoshRosen/remove-copy-in-limit.
      06f0df6d
    • Nam Pham's avatar
      [SPARK-12986][DOC] Fix pydoc warnings in mllib/regression.py · edf4a0e6
      Nam Pham authored
      I have fixed the warnings by running "make html" under "python/docs/". They are caused by not having blank lines around indented paragraphs.
      
      Author: Nam Pham <phamducnam@gmail.com>
      
      Closes #11025 from nampham2/SPARK-12986.
      edf4a0e6
  4. Feb 07, 2016
    • cody koeninger's avatar
      [SPARK-10963][STREAMING][KAFKA] make KafkaCluster public · 140ddef3
      cody koeninger authored
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #9007 from koeninger/SPARK-10963.
      140ddef3
    • Gary King's avatar
      [SPARK-13132][MLLIB] cache standardization param value in LogisticRegression · bc8890b3
      Gary King authored
      cache the value of the standardization Param in LogisticRegression, rather than re-fetching it from the ParamMap for every index and every optimization step in the quasi-newton optimizer
      
      also, fix Param#toString to cache the stringified representation, rather than re-interpolating it on every call, so any other implementations that have similar repeated access patterns will see a benefit.
      
      this change improves training times for one of my test sets from ~7m30s to ~4m30s
      
      Author: Gary King <gary@idibon.com>
      
      Closes #11027 from idigary/spark-13132-optimize-logistic-regression.
      bc8890b3
  5. Feb 06, 2016
  6. Feb 05, 2016
    • Jakob Odersky's avatar
      [SPARK-13171][CORE] Replace future calls with Future · 6883a512
      Jakob Odersky authored
      Trivial search-and-replace to eliminate deprecation warnings in Scala 2.11.
      Also works with 2.10
      
      Author: Jakob Odersky <jakob@odersky.com>
      
      Closes #11085 from jodersky/SPARK-13171.
      6883a512
    • Davies Liu's avatar
      [SPARK-13215] [SQL] remove fallback in codegen · 875f5079
      Davies Liu authored
      Since we remove the configuration for codegen, we are heavily reply on codegen (also TungstenAggregate require the generated MutableProjection to update UnsafeRow), should remove the fallback, which could make user confusing, see the discussion in SPARK-13116.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11097 from davies/remove_fallback.
      875f5079
    • Luc Bourlier's avatar
      [SPARK-13002][MESOS] Send initial request of executors for dyn allocation · 0bb5b733
      Luc Bourlier authored
      Fix for [SPARK-13002](https://issues.apache.org/jira/browse/SPARK-13002) about the initial number of executors when running with dynamic allocation on Mesos.
      Instead of fixing it just for the Mesos case, made the change in `ExecutorAllocationManager`. It is already driving the number of executors running on Mesos, only no the initial value.
      
      The `None` and `Some(0)` are internal details on the computation of resources to reserved, in the Mesos backend scheduler. `executorLimitOption` has to be initialized correctly, otherwise the Mesos backend scheduler will, either, create to many executors at launch, or not create any executors and not be able to recover from this state.
      
      Removed the 'special case' description in the doc. It was not totally accurate, and is not needed anymore.
      
      This doesn't fix the same problem visible with Spark standalone. There is no straightforward way to send the initial value in standalone mode.
      
      Somebody knowing this part of the yarn support should review this change.
      
      Author: Luc Bourlier <luc.bourlier@typesafe.com>
      
      Closes #11047 from skyluc/issue/initial-dyn-alloc-2.
      0bb5b733
    • Bill Chambers's avatar
      [SPARK-13214][DOCS] update dynamicAllocation documentation · 66e1383d
      Bill Chambers authored
      Author: Bill Chambers <bill@databricks.com>
      
      Closes #11094 from anabranch/dynamic-docs.
      66e1383d
    • Wenchen Fan's avatar
      [SPARK-12939][SQL] migrate encoder resolution logic to Analyzer · 1ed354a5
      Wenchen Fan authored
      https://issues.apache.org/jira/browse/SPARK-12939
      
      Now we will catch `ObjectOperator` in `Analyzer` and resolve the `fromRowExpression/deserializer` inside it.  Also update the `MapGroups` and `CoGroup` to pass in `dataAttributes`, so that we can correctly resolve value deserializer(the `child.output` contains both groupking key and values, which may mess things up if they have same-name attribtues). End-to-end tests are added.
      
      follow-ups:
      
      * remove encoders from typed aggregate expression.
      * completely remove resolve/bind in `ExpressionEncoder`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10852 from cloud-fan/bug.
      1ed354a5
    • Shixiong Zhu's avatar
      [SPARK-13166][SQL] Rename DataStreamReaderWriterSuite to DataFrameReaderWriterSuite · 7b73f171
      Shixiong Zhu authored
      A follow up PR for #11062 because it didn't rename the test suite.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11096 from zsxwing/rename.
      7b73f171
    • Reynold Xin's avatar
      [SPARK-13187][SQL] Add boolean/long/double options in DataFrameReader/Writer · 82d84ff2
      Reynold Xin authored
      This patch adds option function for boolean, long, and double types. This makes it slightly easier for Spark users to specify options without turning them into strings. Using the JSON data source as an example.
      
      Before this patch:
      ```scala
      sqlContext.read.option("primitivesAsString", "true").json("/path/to/json")
      ```
      
      After this patch:
      Before this patch:
      ```scala
      sqlContext.read.option("primitivesAsString", true).json("/path/to/json")
      ```
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11072 from rxin/SPARK-13187.
      82d84ff2
    • Jakob Odersky's avatar
      [SPARK-13208][CORE] Replace use of Pairs with Tuple2s · 352102ed
      Jakob Odersky authored
      Another trivial deprecation fix for Scala 2.11
      
      Author: Jakob Odersky <jakob@odersky.com>
      
      Closes #11089 from jodersky/SPARK-13208.
      352102ed
  7. Feb 04, 2016
Loading