Skip to content
Snippets Groups Projects
  1. Aug 03, 2015
    • Yanbo Liang's avatar
      [SPARK-9191] [ML] [Doc] Add ml.PCA user guide and code examples · 8ca287eb
      Yanbo Liang authored
      Add ml.PCA user guide document and code examples for Scala/Java/Python.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #7522 from yanboliang/ml-pca-md and squashes the following commits:
      
      60dec05 [Yanbo Liang] address comments
      f992abe [Yanbo Liang] Add ml.PCA doc and examples
      8ca287eb
    • Kousuke Saruta's avatar
      [SPARK-9558][DOCS]Update docs to follow the increase of memory defaults. · ba1c4e13
      Kousuke Saruta authored
      Now the memory defaults of master and slave in Standalone mode and History Server is 1g, not 512m. So let's update docs.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #7896 from sarutak/update-doc-for-daemon-memory and squashes the following commits:
      
      a77626c [Kousuke Saruta] Fix docs to follow the update of increase of memory defaults
      ba1c4e13
    • Joseph K. Bradley's avatar
      [SPARK-5133] [ML] Added featureImportance to RandomForestClassifier and Regressor · ff9169a0
      Joseph K. Bradley authored
      Added featureImportance to RandomForestClassifier and Regressor.
      
      This follows the scikit-learn implementation here: [https://github.com/scikit-learn/scikit-learn/blob/a95203b249c1cf392f86d001ad999e29b2392739/sklearn/tree/_tree.pyx#L3341]
      
      CC: yanboliang  Would you mind taking a look?  Thanks!
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7838 from jkbradley/dt-feature-importance and squashes the following commits:
      
      72a167a [Joseph K. Bradley] fixed unit test
      86cea5f [Joseph K. Bradley] Modified RF featuresImportances to return Vector instead of Map
      5aa74f0 [Joseph K. Bradley] finally fixed unit test for real
      33df5db [Joseph K. Bradley] fix unit test
      42a2d3b [Joseph K. Bradley] fix unit test
      fe94e72 [Joseph K. Bradley] modified feature importance unit tests
      cc693ee [Feynman Liang] Add classifier tests
      79a6f87 [Feynman Liang] Compare dense vectors in test
      21d01fc [Feynman Liang] Added failing SKLearn test
      ac0b254 [Joseph K. Bradley] Added featureImportance to RandomForestClassifier/Regressor.  Need to add unit tests
      ff9169a0
    • Cheng Lian's avatar
      [SPARK-9554] [SQL] Enables in-memory partition pruning by default · 703e44bf
      Cheng Lian authored
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7895 from liancheng/spark-9554/enable-in-memory-partition-pruning and squashes the following commits:
      
      67c403e [Cheng Lian] Enables in-memory partition pruning by default
      703e44bf
    • Reynold Xin's avatar
      [SQL][minor] Simplify UnsafeRow.calculateBitSetWidthInBytes. · 7a9d09f0
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7897 from rxin/calculateBitSetWidthInBytes and squashes the following commits:
      
      2e73b3a [Reynold Xin] [SQL][minor] Simplify UnsafeRow.calculateBitSetWidthInBytes.
      7a9d09f0
    • Joseph Batchik's avatar
      [SPARK-9511] [SQL] Fixed Table Name Parsing · dfe7bd16
      Joseph Batchik authored
      The issue was that the tokenizer was parsing "1one" into the numeric 1 using the code on line 110. I added another case to accept strings that start with a number and then have a letter somewhere else in it as well.
      
      Author: Joseph Batchik <joseph.batchik@cloudera.com>
      
      Closes #7844 from JDrit/parse_error and squashes the following commits:
      
      b8ca12f [Joseph Batchik] fixed parsing issue by adding another case
      dfe7bd16
    • Andrew Or's avatar
      [SPARK-1855] Local checkpointing · b41a3271
      Andrew Or authored
      Certain use cases of Spark involve RDDs with long lineages that must be truncated periodically (e.g. GraphX). The existing way of doing it is through `rdd.checkpoint()`, which is expensive because it writes to HDFS. This patch provides an alternative to truncate lineages cheaply *without providing the same level of fault tolerance*.
      
      **Local checkpointing** writes checkpointed data to the local file system through the block manager. It is much faster than replicating to a reliable storage and provides the same semantics as long as executors do not fail. It is accessible through a new operator `rdd.localCheckpoint()` and leaves the old one unchanged. Users may even decide to combine the two and call the reliable one less frequently.
      
      The bulk of this patch involves refactoring the checkpointing interface to accept custom implementations of checkpointing. [Design doc](https://issues.apache.org/jira/secure/attachment/12741708/SPARK-7292-design.pdf).
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #7279 from andrewor14/local-checkpoint and squashes the following commits:
      
      729600f [Andrew Or] Oops, fix tests
      34bc059 [Andrew Or] Avoid computing all partitions in local checkpoint
      e43bbb6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      3be5aea [Andrew Or] Address comments
      bf846a6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      ab003a3 [Andrew Or] Fix compile
      c2e111b [Andrew Or] Address comments
      33f167a [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      e908a42 [Andrew Or] Fix tests
      f5be0f3 [Andrew Or] Use MEMORY_AND_DISK as the default local checkpoint level
      a92657d [Andrew Or] Update a few comments
      e58e3e3 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      4eb6eb1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      1bbe154 [Andrew Or] Simplify LocalCheckpointRDD
      48a9996 [Andrew Or] Avoid traversing dependency tree + rewrite tests
      62aba3f [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      db70dc2 [Andrew Or] Express local checkpointing through caching the original RDD
      87d43c6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      c449b38 [Andrew Or] Fix style
      4a182f3 [Andrew Or] Add fine-grained tests for local checkpointing
      53b363b [Andrew Or] Rename a few more awkwardly named methods (minor)
      e4cf071 [Andrew Or] Simplify LocalCheckpointRDD + docs + clean ups
      4880deb [Andrew Or] Fix style
      d096c67 [Andrew Or] Fix mima
      172cb66 [Andrew Or] Fix mima?
      e53d964 [Andrew Or] Fix style
      56831c5 [Andrew Or] Add a few warnings and clear exception messages
      2e59646 [Andrew Or] Add local checkpoint clean up tests
      4dbbab1 [Andrew Or] Refactor CheckpointSuite to test local checkpointing
      4514dc9 [Andrew Or] Clean local checkpoint files through RDD cleanups
      0477eec [Andrew Or] Rename a few methods with awkward names (minor)
      2e902e5 [Andrew Or] First implementation of local checkpointing
      8447454 [Andrew Or] Fix tests
      4ac1896 [Andrew Or] Refactor checkpoint interface for modularity
      b41a3271
    • Joseph K. Bradley's avatar
      [SPARK-9528] [ML] Changed RandomForestClassifier to extend ProbabilisticClassifier · 69f5a7c9
      Joseph K. Bradley authored
      RandomForestClassifier now outputs rawPrediction based on tree probabilities, plus probability column computed from normalized rawPrediction.
      
      CC: holdenk
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7859 from jkbradley/rf-prob and squashes the following commits:
      
      6c28f51 [Joseph K. Bradley] Changed RandomForestClassifier to extend ProbabilisticClassifier
      69f5a7c9
    • Reynold Xin's avatar
      8be198c8
    • Davies Liu's avatar
      [SPARK-9518] [SQL] cleanup generated UnsafeRowJoiner and fix bug · 191bf268
      Davies Liu authored
      Currently, when copy the bitsets, we didn't consider that the row1 may not sit in the beginning of byte array.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7892 from davies/clean_join and squashes the following commits:
      
      14cce9e [Davies Liu] cleanup generated UnsafeRowJoiner and fix bug
      191bf268
    • Wenchen Fan's avatar
      [SPARK-9551][SQL] add a cheap version of copy for UnsafeRow to reuse a copy buffer · 137f4786
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7885 from cloud-fan/cheap-copy and squashes the following commits:
      
      0900ca1 [Wenchen Fan] replace == with ===
      73f4ada [Wenchen Fan] add tests
      07b865a [Wenchen Fan] add a cheap version of copy
      137f4786
    • Timothy Chen's avatar
      [SPARK-8873] [MESOS] Clean up shuffle files if external shuffle service is used · 95dccc63
      Timothy Chen authored
      This patch builds directly on #7820, which is largely written by tnachen. The only addition is one commit for cleaning up the code. There should be no functional differences between this and #7820.
      
      Author: Timothy Chen <tnachen@gmail.com>
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #7881 from andrewor14/tim-cleanup-mesos-shuffle and squashes the following commits:
      
      8894f7d [Andrew Or] Clean up code
      2a5fa10 [Andrew Or] Merge branch 'mesos_shuffle_clean' of github.com:tnachen/spark into tim-cleanup-mesos-shuffle
      fadff89 [Timothy Chen] Address comments.
      e4d0f1d [Timothy Chen] Clean up external shuffle data on driver exit with Mesos.
      95dccc63
    • Yin Huai's avatar
      [SPARK-9240] [SQL] Hybrid aggregate operator using unsafe row · 1ebd41b1
      Yin Huai authored
      This PR adds a base aggregation iterator `AggregationIterator`, which is used to create `SortBasedAggregationIterator` (for sort-based aggregation) and `UnsafeHybridAggregationIterator` (first it tries hash-based aggregation and falls back to the sort-based aggregation (using external sorter) if we cannot allocate memory for the map). With these two iterators, we will not need existing iterators and I am removing those. Also, we can use a single physical `Aggregate` operator and it internally determines what iterators to used.
      
      https://issues.apache.org/jira/browse/SPARK-9240
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #7813 from yhuai/AggregateOperator and squashes the following commits:
      
      e317e2b [Yin Huai] Remove unnecessary change.
      74d93c5 [Yin Huai] Merge remote-tracking branch 'upstream/master' into AggregateOperator
      ba6afbc [Yin Huai] Add a little bit more comments.
      c9cf3b6 [Yin Huai] update
      0f1b06f [Yin Huai] Remove unnecessary code.
      21fd15f [Yin Huai] Remove unnecessary change.
      964f88b [Yin Huai] Implement fallback strategy.
      b1ea5cf [Yin Huai] wip
      7fcbd87 [Yin Huai] Add a flag to control what iterator to use.
      533d5b2 [Yin Huai] Prepare for fallback!
      33b7022 [Yin Huai] wip
      bd9282b [Yin Huai] UDAFs now supports UnsafeRow.
      f52ee53 [Yin Huai] wip
      3171f44 [Yin Huai] wip
      d2c45a0 [Yin Huai] wip
      f60cc83 [Yin Huai] Also check input schema.
      af32210 [Yin Huai] Check iter.hasNext before we create an iterator because the constructor of the iterato will read at least one row from a non-empty input iter.
      299008c [Yin Huai] First round cleanup.
      3915bac [Yin Huai] Create a base iterator class for aggregation iterators and add the initial version of the hybrid iterator.
      1ebd41b1
    • Yijie Shen's avatar
      [SPARK-9549][SQL] fix bugs in expressions · 98d6d9c7
      Yijie Shen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9549
      
      This PR fix the following bugs:
      1.  `UnaryMinus`'s codegen version would fail to compile when the input is `Long.MinValue`
      2.  `BinaryComparison` would fail to compile in codegen mode when comparing Boolean types.
      3.  `AddMonth` would fail if passed a huge negative month, which would lead accessing negative index of `monthDays` array.
      4.  `Nanvl` with different type operands.
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7882 from yjshen/minor_bug_fix and squashes the following commits:
      
      41bbd2c [Yijie Shen] fix bug in Nanvl type coercion
      3dee204 [Yijie Shen] address comments
      4fa5de0 [Yijie Shen] fix bugs in expressions
      98d6d9c7
    • Wenchen Fan's avatar
      [SPARK-9404][SPARK-9542][SQL] unsafe array data and map data · 608353c8
      Wenchen Fan authored
      This PR adds a UnsafeArrayData, current we encode it in this way:
      
      first 4 bytes is the # elements
      then each 4 byte is the start offset of the element, unless it is negative, in which case the element is null.
      followed by the elements themselves
      
      an example:  [10, 11, 12, 13, null, 14] will be encoded as:
      5, 28, 32, 36, 40, -44, 44, 10, 11, 12, 13, 14
      
      Note that, when we read a UnsafeArrayData from bytes, we can read the first 4 bytes as numElements and take the rest(first 4 bytes skipped) as value region.
      
      unsafe map data just use 2 unsafe array data, first 4 bytes is # of elements, second 4 bytes is numBytes of key array, the follows key array data and value array data.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7752 from cloud-fan/unsafe-array and squashes the following commits:
      
      3269bd7 [Wenchen Fan] fix a bug
      6445289 [Wenchen Fan] add unit tests
      49adf26 [Wenchen Fan] add unsafe map
      20d1039 [Wenchen Fan] add comments and unsafe converter
      821b8db [Wenchen Fan] add unsafe array
      608353c8
    • Yin Huai's avatar
      [SPARK-9372] [SQL] Filter nulls in join keys · 687c8c37
      Yin Huai authored
      This PR adds an optimization rule, `FilterNullsInJoinKey`, to add `Filter` before join operators to filter out rows having null values for join keys.
      
      This optimization is guarded by a new SQL conf, `spark.sql.advancedOptimization`.
      
      The code in this PR was authored by yhuai; I'm opening this PR to factor out this change from #7685, a larger pull request which contains two other optimizations.
      
      Author: Yin Huai <yhuai@databricks.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7768 from JoshRosen/filter-nulls-in-join-key and squashes the following commits:
      
      c02fc3f [Yin Huai] Address Josh's comments.
      0a8e096 [Yin Huai] Update comments.
      ea7d5a6 [Yin Huai] Make sure we do not keep adding filters.
      be88760 [Yin Huai] Make it clear that FilterNullsInJoinKeySuite.scala is used to test FilterNullsInJoinKey.
      8bb39ad [Yin Huai] Fix non-deterministic tests.
      303236b [Josh Rosen] Revert changes that are unrelated to null join key filtering
      40eeece [Josh Rosen] Merge remote-tracking branch 'origin/master' into filter-nulls-in-join-key
      c57a954 [Yin Huai] Bug fix.
      d3d2e64 [Yin Huai] First round of cleanup.
      f9516b0 [Yin Huai] Style
      c6667e7 [Yin Huai] Add PartitioningCollection.
      e616d3b [Yin Huai] wip
      7c2d2d8 [Yin Huai] Bug fix and refactoring.
      69bb072 [Yin Huai] Introduce NullSafeHashPartitioning and NullUnsafePartitioning.
      d5b84c3 [Yin Huai] Do not add unnessary filters.
      2201129 [Yin Huai] Filter out rows that will not be joined in equal joins early.
      687c8c37
    • Yanbo Liang's avatar
      [SPARK-9536] [SPARK-9537] [SPARK-9538] [ML] [PYSPARK] ml.classification... · 4cdd8ecd
      Yanbo Liang authored
      [SPARK-9536] [SPARK-9537] [SPARK-9538] [ML] [PYSPARK] ml.classification support raw and probability prediction for PySpark
      
      Make the following ml.classification class support raw and probability prediction for PySpark:
      ```scala
      NaiveBayesModel
      DecisionTreeClassifierModel
      LogisticRegressionModel
      ```
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #7866 from yanboliang/spark-9536-9537 and squashes the following commits:
      
      2934dab [Yanbo Liang] ml.NaiveBayes, ml.DecisionTreeClassifier and ml.LogisticRegression support probability prediction
      4cdd8ecd
  2. Aug 02, 2015
    • Yin Huai's avatar
      [SPARK-2205] [SQL] Avoid unnecessary exchange operators in multi-way joins · 114ff926
      Yin Huai authored
      This PR adds `PartitioningCollection`, which is used to represent the `outputPartitioning` for SparkPlans with multiple children (e.g. `ShuffledHashJoin`). So, a `SparkPlan` can have multiple descriptions of its partitioning schemes. Taking `ShuffledHashJoin` as an example, it has two descriptions of its partitioning schemes, i.e. `left.outputPartitioning` and `right.outputPartitioning`. So when we have a query like `select * from t1 join t2 on (t1.x = t2.x) join t3 on (t2.x = t3.x)` will only have three Exchange operators (when shuffled joins are needed) instead of four.
      
      The code in this PR was authored by yhuai; I'm opening this PR to factor out this change from #7685, a larger pull request which contains two other optimizations.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7773)
      <!-- Reviewable:end -->
      
      Author: Yin Huai <yhuai@databricks.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7773 from JoshRosen/multi-way-join-planning-improvements and squashes the following commits:
      
      5c45924 [Josh Rosen] Merge remote-tracking branch 'origin/master' into multi-way-join-planning-improvements
      cd8269b [Josh Rosen] Refactor test to use SQLTestUtils
      2963857 [Yin Huai] Revert unnecessary SqlConf change.
      73913f7 [Yin Huai] Add comments and test. Also, revert the change in ShuffledHashOuterJoin for now.
      4a99204 [Josh Rosen] Delete unrelated expression change
      884ab95 [Josh Rosen] Carve out only SPARK-2205 changes.
      247e5fa [Josh Rosen] Merge remote-tracking branch 'origin/master' into multi-way-join-planning-improvements
      c57a954 [Yin Huai] Bug fix.
      d3d2e64 [Yin Huai] First round of cleanup.
      f9516b0 [Yin Huai] Style
      c6667e7 [Yin Huai] Add PartitioningCollection.
      e616d3b [Yin Huai] wip
      7c2d2d8 [Yin Huai] Bug fix and refactoring.
      69bb072 [Yin Huai] Introduce NullSafeHashPartitioning and NullUnsafePartitioning.
      d5b84c3 [Yin Huai] Do not add unnessary filters.
      2201129 [Yin Huai] Filter out rows that will not be joined in equal joins early.
      114ff926
    • Reynold Xin's avatar
      [SPARK-9546][SQL] Centralize orderable data type checking. · 30e89111
      Reynold Xin authored
      This pull request creates two isOrderable functions in RowOrdering that can be used to check whether a data type or a sequence of expressions can be used in sorting.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7880 from rxin/SPARK-9546 and squashes the following commits:
      
      f9e322d [Reynold Xin] Fixed tests.
      0439b43 [Reynold Xin] [SPARK-9546][SQL] Centralize orderable data type checking.
      30e89111
    • KaiXinXiaoLei's avatar
      [SPARK-9535][SQL][DOCS] Modify document for codegen. · 536d2adc
      KaiXinXiaoLei authored
      #7142 made codegen enabled by default so let's modify the corresponding documents.
      
      Closes #7142
      
      Author: KaiXinXiaoLei <huleilei1@huawei.com>
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #7863 from sarutak/SPARK-9535 and squashes the following commits:
      
      0884424 [Kousuke Saruta] Removed a line which mentioned about the effect of codegen enabled
      3c11af0 [Kousuke Saruta] Merge branch 'sqlconfig' of https://github.com/KaiXinXiaoLei/spark into SPARK-9535
      4ee531d [KaiXinXiaoLei] delete space
      4cfd11d [KaiXinXiaoLei] change spark.sql.planner.externalSort
      d624cf8 [KaiXinXiaoLei] sql config is wrong
      536d2adc
    • Reynold Xin's avatar
      [SPARK-9543][SQL] Add randomized testing for UnsafeKVExternalSorter. · 9d03ad91
      Reynold Xin authored
      The detailed approach is documented in UnsafeKVExternalSorterSuite.testKVSorter(), working as follows:
      
      1. Create input by generating data randomly based on the given key/value schema (which is also randomly drawn from a list of candidate types)
      2. Run UnsafeKVExternalSorter on the generated data
      3. Collect the output from the sorter, and make sure the keys are sorted in ascending order
      4. Sort the input by both key and value, and sort the sorter output also by both key and value. Compare the sorted input and sorted output together to make sure all the key/values match.
      5. Check memory allocation to make sure there is no memory leak.
      
      There is also a spill flag. When set to true, the sorter will spill probabilistically roughly every 100 records.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7873 from rxin/kvsorter-randomized-test and squashes the following commits:
      
      a08c251 [Reynold Xin] Resource cleanup.
      0488b5c [Reynold Xin] [SPARK-9543][SQL] Add randomized testing for UnsafeKVExternalSorter.
      9d03ad91
    • Liang-Chi Hsieh's avatar
      [SPARK-7937][SQL] Support comparison on StructType · 0722f433
      Liang-Chi Hsieh authored
      This brings #6519 up-to-date with master branch.
      
      Closes #6519.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7877 from rxin/sort-struct and squashes the following commits:
      
      4968231 [Reynold Xin] Minor fixes.
      2537813 [Reynold Xin] Merge branch 'compare_named_struct' of github.com:viirya/spark-1 into sort-struct
      d2ba8ad [Liang-Chi Hsieh] Remove unused import.
      3a3f40e [Liang-Chi Hsieh] Don't need to add compare to InternalRow because we can use RowOrdering.
      dae6aad [Liang-Chi Hsieh] Fix nested struct.
      d5349c7 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
      43d4354 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
      1f66196 [Liang-Chi Hsieh] Reuse RowOrdering and GenerateOrdering.
      f8b2e9c [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
      1187a65 [Liang-Chi Hsieh] Fix scala style.
      9d67f68 [Liang-Chi Hsieh] Fix wrongly merging.
      8f4d775 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
      94b27d5 [Liang-Chi Hsieh] Remove test for error on complex type comparison.
      2071693 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
      3c142e4 [Liang-Chi Hsieh] Fix scala style.
      cf58dc3 [Liang-Chi Hsieh] Use checkAnswer.
      f651b8d [Liang-Chi Hsieh] Remove Either and move orderings to BinaryComparison to reuse it.
      b6e1009 [Liang-Chi Hsieh] Fix scala style.
      3922b54 [Liang-Chi Hsieh] Support ordering on named_struct.
      0722f433
    • Reynold Xin's avatar
      [SPARK-9531] [SQL] UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter · 2e981b7b
      Reynold Xin authored
      This pull request adds a destructAndCreateExternalSorter method to UnsafeFixedWidthAggregationMap. The new method does the following:
      
      1. Creates a new external sorter UnsafeKVExternalSorter
      2. Adds all the data into an in-memory sorter, sorts them
      3. Spills the sorted in-memory data to disk
      
      This method can be used to fallback to sort-based aggregation when under memory pressure.
      
      The pull request also includes accounting fixes from JoshRosen.
      
      TODOs (that can be done in follow-up PRs)
      - [x] Address Josh's feedbacks from #7849
      - [x] More documentation and test cases
      - [x] Make sure we are doing memory accounting correctly with test cases (e.g. did we release the memory in BytesToBytesMap twice?)
      - [ ] Look harder at possible memory leaks and exception handling
      - [ ] Randomized tester for the KV sorter as well as the aggregation map
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7860 from rxin/kvsorter and squashes the following commits:
      
      986a58c [Reynold Xin] Bug fix.
      599317c [Reynold Xin] Style fix and slightly more compact code.
      fe7bd4e [Reynold Xin] Bug fixes.
      fd71bef [Reynold Xin] Merge remote-tracking branch 'josh/large-records-in-sql-sorter' into kvsorter-with-josh-fix
      3efae38 [Reynold Xin] More fixes and documentation.
      45f1b09 [Josh Rosen] Ensure that spill files are cleaned up
      f6a9bd3 [Reynold Xin] Josh feedback.
      9be8139 [Reynold Xin] Remove testSpillFrequency.
      7cbe759 [Reynold Xin] [SPARK-9531][SQL] UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter.
      ae4a8af [Josh Rosen] Detect leaked unsafe memory in UnsafeExternalSorterSuite.
      52f9b06 [Josh Rosen] Detect ShuffleMemoryManager leaks in UnsafeExternalSorter.
      2e981b7b
    • Xiangrui Meng's avatar
      [SPARK-9527] [MLLIB] add PrefixSpanModel and make PrefixSpan Java friendly · 66924ffa
      Xiangrui Meng authored
      1. Use `PrefixSpanModel` to wrap the frequent sequences.
      2. Define `FreqSequence` to wrap each frequent sequence, which contains a Java-friendly method `javaSequence`
      3. Overload `run` for Java users.
      4. Added a unit test in Java to check Java compatibility.
      
      zhangjiajin feynmanliang
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #7869 from mengxr/SPARK-9527 and squashes the following commits:
      
      4345594 [Xiangrui Meng] add PrefixSpanModel and make PrefixSpan Java friendly
      66924ffa
    • Reynold Xin's avatar
      [SPARK-9208][SQL] Sort DataFrame functions alphabetically. · 8eafa2ae
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7861 from rxin/api-audit and squashes the following commits:
      
      7200256 [Reynold Xin] [SPARK-9208][SQL] Sort DataFrame functions alphabetically.
      8eafa2ae
    • Yu ISHIKAWA's avatar
      [SPARK-9149] [ML] [EXAMPLES] Add an example of spark.ml KMeans · 244016a9
      Yu ISHIKAWA authored
      [SPARK-9149] Add an example of spark.ml KMeans - ASF JIRA https://issues.apache.org/jira/browse/SPARK-9149
      
      jkbradley Should we support other data formats, such as TSV or CSV. I have implemented these examples which support only space separated file which is same as the example for `spark.mllib`'s `KMeans`.
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #7697 from yu-iskw/SPARK-9149 and squashes the following commits:
      
      7137bad [Yu ISHIKAWA] Fix the typo
      56b9da2 [Yu ISHIKAWA] Fix the place of the wrong import statment
      554e574 [Yu ISHIKAWA] Change the way to format input data in KMeansExample
      e7a948a [Yu ISHIKAWA] Import spark.ml.clustering.KMeans
      1901e0c [Yu ISHIKAWA] Change how to initialize an array for a DataFrame schema
      d8043f5 [Yu ISHIKAWA] Return a value directly
      d81bf55 [Yu ISHIKAWA] Fix a typo and its access specifiers
      3e0862d [Yu ISHIKAWA] Make KMeansExample more simple
      51ce9c1 [Yu ISHIKAWA] Make JavaKMeansExample more simple
      a5a01e0 [Yu ISHIKAWA] Fix a Javadoc about the command to execute the example
      b09ec13 [Yu ISHIKAWA] [SPARK-9149][ML][Examples] Add an example of spark.ml KMeans
      244016a9
    • Sean Owen's avatar
      [SPARK-9521] [BUILD] Require Maven 3.3.3+ in the build · 9d1c0252
      Sean Owen authored
      Enforce Maven 3.3.3+ in the build. (Also update the scala compiler plugin while we're at it.)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7852 from srowen/SPARK-9521 and squashes the following commits:
      
      3093039 [Sean Owen] Enforce Maven 3.3.3+ in the build. (Also update the scala compiler plugin while we're at it.)
      9d1c0252
    • Davies Liu's avatar
      [SPARK-9529] [SQL] improve TungstenSort on DecimalType · 16b928c5
      Davies Liu authored
      Generate prefix for DecimalType, fix the random generator of decimal
      
      cc JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7857 from davies/sort_decimal and squashes the following commits:
      
      2433959 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sort_decimal
      de24253 [Davies Liu] fix style
      0a54c1a [Davies Liu] sort decimal
      16b928c5
    • Feynman Liang's avatar
      [SPARK-9000] [MLLIB] Support generic item types in PrefixSpan · 28d944e8
      Feynman Liang authored
      mengxr Please review after #7818 merges and master is rebased.
      
      Continues work by rikima
      
      Closes #7400
      
      Author: Feynman Liang <fliang@databricks.com>
      Author: masaki rikitoku <rikima3132@gmail.com>
      
      Closes #7837 from feynmanliang/SPARK-7400-genericItems and squashes the following commits:
      
      8b2c756 [Feynman Liang] Remove orig
      92443c8 [Feynman Liang] Style fixes
      42c6349 [Feynman Liang] Style fix
      14e67fc [Feynman Liang] Generic prefixSpan itemtypes
      b3b21e0 [Feynman Liang] Initial support for generic itemtype in public api
      b86e0d5 [masaki rikitoku] modify to support generic item type
      28d944e8
  3. Aug 01, 2015
    • Davies Liu's avatar
      [SPARK-9459] [SQL] use generated FromUnsafeProjection to do deep copy for UTF8String and struct · 57084e0c
      Davies Liu authored
      When accessing a column in UnsafeRow, it's good to avoid the copy, then we should do deep copy when turn the UnsafeRow into generic Row, this PR brings generated FromUnsafeProjection to do that.
      
      This PR also fix the expressions that cache the UTF8String, which should also copy it.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7840 from davies/avoid_copy and squashes the following commits:
      
      230c8a1 [Davies Liu] address comment
      fd797c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into avoid_copy
      e095dd0 [Davies Liu] rollback rename
      8ef5b0b [Davies Liu] copy String in Columnar
      81360b8 [Davies Liu] fix class name
      9aecb88 [Davies Liu] use FromUnsafeProjection to do deep copy for UTF8String and struct
      57084e0c
    • Davies Liu's avatar
      [SPARK-8185] [SPARK-8188] [SPARK-8191] [SQL] function datediff,... · c1b0cbd7
      Davies Liu authored
      [SPARK-8185] [SPARK-8188] [SPARK-8191] [SQL] function datediff, to_utc_timestamp, from_utc_timestamp
      
      This PR is based on #7643 , thanks to adrian-wang
      
      Author: Davies Liu <davies@databricks.com>
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #7847 from davies/datediff and squashes the following commits:
      
      74333d7 [Davies Liu] fix bug
      22d8a8c [Davies Liu] optimize
      85cdd21 [Davies Liu] remove unnecessary tests
      241d90c [Davies Liu] Merge branch 'master' of github.com:apache/spark into datediff
      e9dc0f5 [Davies Liu] fix datediff/to_utc_timestamp/from_utc_timestamp
      c360447 [Daoyuan Wang] function datediff, to_utc_timestamp, from_utc_timestamp (commits merged)
      c1b0cbd7
    • HuJiayin's avatar
      [SPARK-8269] [SQL] string function: initcap · 00cd92f3
      HuJiayin authored
      This PR is based on #7208 , thanks to HuJiayin
      
      Closes #7208
      
      Author: HuJiayin <jiayin.hu@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7850 from davies/initcap and squashes the following commits:
      
      54472e9 [Davies Liu] fix python test
      17ffe51 [Davies Liu] Merge branch 'master' of github.com:apache/spark into initcap
      ca46390 [Davies Liu] Merge branch 'master' of github.com:apache/spark into initcap
      3a906e4 [Davies Liu] implement title case in UTF8String
      8b2506a [HuJiayin] Update functions.py
      2cd43e5 [HuJiayin] fix python style check
      b616c0e [HuJiayin] add python api
      1f5a0ef [HuJiayin] add codegen
      7e0c604 [HuJiayin] Merge branch 'master' of https://github.com/apache/spark into initcap
      6a0b958 [HuJiayin] add column
      c79482d [HuJiayin] support soundex
      7ce416b [HuJiayin] support initcap rebase code
      00cd92f3
    • Davies Liu's avatar
      [SPARK-9495] prefix of DateType/TimestampType · 5d9e33d9
      Davies Liu authored
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7856 from davies/sort_improve and squashes the following commits:
      
      5fc81bd [Davies Liu] support DateType/TimestampType
      5d9e33d9
    • Meihua Wu's avatar
      [SPARK-9530] [MLLIB] ScalaDoc should not indicate LDAModel.describeTopics and... · 84a6982b
      Meihua Wu authored
      [SPARK-9530] [MLLIB] ScalaDoc should not indicate LDAModel.describeTopics and DistributedLDAModel.topDocumentsPerTopic as approximate
      
      Remove ScalaDoc that suggests describeTopics and topDocumentsPerTopic are approximate.
      
      cc jkbradley
      
      Author: Meihua Wu <meihuawu@umich.edu>
      
      Closes #7858 from rotationsymmetry/SPARK-9530 and squashes the following commits:
      
      b574923 [Meihua Wu] Remove ScalaDoc that suggests describeTopics and topDocumentsPerTopic are approximate.
      84a6982b
    • Reynold Xin's avatar
      [SPARK-9520] [SQL] Support in-place sort in UnsafeFixedWidthAggregationMap · 3d1535d4
      Reynold Xin authored
      This pull request adds a sortedIterator method to UnsafeFixedWidthAggregationMap that sorts its data in-place by the grouping key.
      
      This is needed so we can fallback to external sorting for aggregation.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7849 from rxin/bytes2bytes-sorting and squashes the following commits:
      
      75018c6 [Reynold Xin] Updated documentation.
      81a8694 [Reynold Xin] [SPARK-9520][SQL] Support in-place sort in UnsafeFixedWidthAggregationMap.
      3d1535d4
    • Marcelo Vanzin's avatar
      [SPARK-9491] Avoid fetching HBase tokens when not needed. · df733cbe
      Marcelo Vanzin authored
      Look at HBase's configuration to make sure it's configured for
      Kerberos. If the HBase configuration is missing, or if HBase is
      configured for non-kerberos authentication, then skip getting
      tokens.
      
      Reference: http://hbase.apache.org/book.html#security.prerequisites
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #7810 from vanzin/SPARK-9491 and squashes the following commits:
      
      a57c776 [Marcelo Vanzin] [SPARK-9491] Avoid fetching HBase tokens when not needed.
      df733cbe
    • Andrew Or's avatar
      [SPARK-4751] Dynamic allocation in standalone mode · 6688ba6e
      Andrew Or authored
      Dynamic allocation is a feature that allows a Spark application to scale the number of executors up and down dynamically based on the workload. Support was first introduced in YARN since 1.2, and then extended to Mesos coarse-grained mode recently. Today, it is finally supported in standalone mode as well!
      
      I tested this locally and it works as expected. This is WIP because unit tests are coming.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #7532 from andrewor14/standalone-da and squashes the following commits:
      
      b3c1736 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
      879e928 [Andrew Or] Add end-to-end tests for standalone dynamic allocation
      accc8f6 [Andrew Or] Address comments
      ee686a8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
      c0a2c02 [Andrew Or] Fix build after merge conflict
      24149eb [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
      2e762d6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
      6832bd7 [Andrew Or] Add tests for scheduling with executor limit
      a82e907 [Andrew Or] Fix comments
      0a8be79 [Andrew Or] Simplify logic by removing the worker blacklist
      b7742af [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
      2eb5f3f [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
      1334e9a [Andrew Or] Fix MiMa
      32abe44 [Andrew Or] Fix style
      58cb06f [Andrew Or] Privatize worker blacklist for cleanliness
      42ac215 [Andrew Or] Clean up comments and rewrite code for readability
      49702d1 [Andrew Or] Clean up shuffle files after application exits
      80047aa [Andrew Or] First working implementation
      6688ba6e
    • zhichao.li's avatar
      [SPARK-8263] [SQL] substr/substring should also support binary type · c5166f7a
      zhichao.li authored
      This is based on #7641, thanks to zhichao-li
      
      Closes #7641
      
      Author: zhichao.li <zhichao.li@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7848 from davies/substr and squashes the following commits:
      
      461b709 [Davies Liu] remove bytearry from tests
      b45377a [Davies Liu] Merge branch 'master' of github.com:apache/spark into substr
      01d795e [zhichao.li] scala style
      99aa130 [zhichao.li] add substring to dataframe
      4f68bfe [zhichao.li] add binary type support for substring
      c5166f7a
    • Cheng Hao's avatar
      [SPARK-8232] [SQL] Add sort_array support · cf6c9ca3
      Cheng Hao authored
      This PR is based on #7581 , just fix the conflict.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7851 from davies/sort_array and squashes the following commits:
      
      a80ef66 [Davies Liu] fix conflict
      7cfda65 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sort_array
      664c960 [Cheng Hao] update the sort_array by using the ArrayData
      276d2d5 [Cheng Hao] add empty line
      0edab9c [Cheng Hao] Add asending/descending support for sort_array
      80fc0f8 [Cheng Hao] Add type checking
      a42b678 [Cheng Hao] Add sort_array support
      cf6c9ca3
    • Yuhao Yang's avatar
      [SPARK-8169] [ML] Add StopWordsRemover as a transformer · 87656650
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-8169
      
      stop words: http://en.wikipedia.org/wiki/Stop_words
      
      StopWordsRemover takes a string array column and outputs a string array column with all defined stop words removed. The transformer should also come with a standard set of stop words as default.
      
      Currently I used a minimum stop words set since on some [case](http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html), small set of stop words is preferred.
      ASCII char has been tested, Yet I cannot check it in due to style check.
      
      Further thought,
      1. Maybe I should use OpenHashSet. Is it recommended?
      2. Currently I leave the null in input array untouched, i.e. Array(null, null) => Array(null, null).
      3. If the current stop words set looks too limited, any suggestion for replacement? We can have something similar to the one in [SKlearn](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py).
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #6742 from hhbyyh/stopwords and squashes the following commits:
      
      fa959d8 [Yuhao Yang] separating udf
      f190217 [Yuhao Yang] replace default list and other small fix
      04403ab [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into stopwords
      b3aa957 [Yuhao Yang] add stopWordsRemover
      87656650
Loading