Skip to content
Snippets Groups Projects
Commit fae4e2d6 authored by ksonj's avatar ksonj Committed by Reynold Xin
Browse files

[SPARK-7035] Encourage __getitem__ over __getattr__ on column access in the Python DataFrame API

Author: ksonj <kson@siberie.de>

Closes #5971 from ksonj/doc and squashes the following commits:

dadfebb [ksonj] __getitem__ is cleaner than __getattr__
parent fa8fddff
No related branches found
No related tags found
No related merge requests found
......@@ -139,7 +139,6 @@ DataFrames provide a domain-specific language for structured data manipulation i
Here we include some basic examples of structured data processing using DataFrames:
<div class="codetabs">
<div data-lang="scala" markdown="1">
{% highlight scala %}
......@@ -242,6 +241,12 @@ df.groupBy("age").count().show();
</div>
<div data-lang="python" markdown="1">
In Python it's possible to access a DataFrame's columns either by attribute
(`df.age`) or by indexing (`df['age']`). While the former is convenient for
interactive data exploration, users are highly encouraged to use the
latter form, which is future proof and won't break with column names that
are also attributes on the DataFrame class.
{% highlight python %}
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
......@@ -270,14 +275,14 @@ df.select("name").show()
## Justin
# Select everybody, but increment the age by 1
df.select(df.name, df.age + 1).show()
df.select(df['name'], df['age'] + 1).show()
## name (age + 1)
## Michael null
## Andy 31
## Justin 20
# Select people older than 21
df.filter(df.age > 21).show()
df.filter(df['age'] > 21).show()
## age name
## 30 Andy
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment